This is the introduction to my final assignment for the open data science I will be working on the alcohol data set
#clear environment
rm(list=ls())
#read data
fpath<- "C:/Users/oyeda/Desktop/OPEN_DATA_SCIENCE/IODS-final/data/"
alco_data <- read.table(paste0(fpath, "alcohol_student.csv"), sep=",", header=T)
alco_data<- alco_data[,-1]
#necessary packages
#install.packages("dismo")
library(dplyr)
library(ggplot2)
library(corrplot)
library(GGally)
library(tidyr)
library(tidyverse)
library(gbm)
library(dismo)
library(caTools)
library(mgcv)
library(MASS)
library(FactoMineR)
My aim is to demonstrate the use of various statistical techniques in getting insight in to the dataset, to understand the patterns alcohol consumption consumptions and absences amongst students in two schools. Further, I seek to explore the distribution of grades amongst male and female students; and factors that might be affecting this. Firstly, I will use the basic desctiptive statitstics to understand the distribution, correlation and dependencies(cor, barplot, histogram, biplot etc) I will also be using the regression models(Generalized Linear Model(GLM), Generalised Additive Model(GAM) and Generalised Boosted Model(GBM)). Furthermore, I will test my models using 70:30 data split.
I will also explore Linear Discriminant Analysis(LDA) and Multiple Correspondence Analysis(MCA), to classify and categorise the data.
What are the factors considerably affecting students’ grades?
What are the factors causing absences amongst students?
What are the factors significantly causing high alcohol consumption amongst students.
categ = apply(alco_data, 2, function(x) nlevels(as.factor(x)))
categ
## school sex age address famsize Pstatus
## 2 2 7 2 2 2
## Medu Fedu Mjob Fjob reason nursery
## 5 5 5 5 4 2
## internet guardian traveltime studytime failures schoolsup
## 2 3 4 4 4 2
## famsup paid activities higher romantic famrel
## 2 2 2 2 2 5
## freetime goout Dalc Walc health absences
## 5 5 5 5 5 26
## G1 G2 G3 alc_use high_use
## 14 15 18 9 2
dim(alco_data)
## [1] 382 35
str(alco_data)
## 'data.frame': 382 obs. of 35 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
## $ age : int 15 15 15 15 15 15 15 15 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 1 1 1 1 1 1 1 1 1 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 1 1 1 1 1 2 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 2 2 2 2 2 2 2 2 2 1 ...
## $ Medu : int 1 1 2 2 3 3 3 2 3 3 ...
## $ Fedu : int 1 1 2 4 3 4 4 2 1 3 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 3 1 4 4 4 4 2 3 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 3 3 3 2 4 2 5 4 3 2 ...
## $ reason : Factor w/ 4 levels "course","home",..: 2 4 4 1 4 1 1 4 4 4 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 1 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 2 2 2 1 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 2 2 2 3 2 1 2 1 1 ...
## $ traveltime: int 2 1 1 1 2 1 2 2 2 1 ...
## $ studytime : int 4 2 1 3 3 3 3 2 4 4 ...
## $ failures : int 0 1 0 0 1 0 1 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 1 2 1 2 ...
## $ famsup : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 1 ...
## $ paid : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 2 1 1 ...
## $ activities: Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 1 1 1 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 2 1 1 1 ...
## $ famrel : int 3 3 4 4 4 4 4 4 4 4 ...
## $ freetime : int 1 3 3 3 2 3 2 1 4 3 ...
## $ goout : int 2 4 1 2 1 2 2 3 2 3 ...
## $ Dalc : int 1 2 1 1 2 1 2 1 2 1 ...
## $ Walc : int 1 4 1 1 3 1 2 3 3 1 ...
## $ health : int 1 5 2 5 3 5 5 4 3 4 ...
## $ absences : int 3 2 8 2 5 2 0 1 9 10 ...
## $ G1 : int 10 10 14 10 12 12 11 10 16 10 ...
## $ G2 : int 12 8 13 10 12 12 6 10 16 10 ...
## $ G3 : int 12 8 12 9 12 12 6 10 16 10 ...
## $ alc_use : num 1 3 1 1 2.5 1 2 2 2.5 1 ...
## $ high_use : logi FALSE TRUE FALSE FALSE TRUE FALSE ...
summary(alco_data)
## school sex age address famsize Pstatus
## GP:342 F:198 Min. :15.00 R: 81 GT3:278 A: 38
## MS: 40 M:184 1st Qu.:16.00 U:301 LE3:104 T:344
## Median :17.00
## Mean :16.59
## 3rd Qu.:17.00
## Max. :22.00
## Medu Fedu Mjob Fjob
## Min. :0.000 Min. :0.000 at_home : 53 at_home : 16
## 1st Qu.:2.000 1st Qu.:2.000 health : 33 health : 17
## Median :3.000 Median :3.000 other :138 other :211
## Mean :2.806 Mean :2.565 services: 96 services:107
## 3rd Qu.:4.000 3rd Qu.:4.000 teacher : 62 teacher : 31
## Max. :4.000 Max. :4.000
## reason nursery internet guardian traveltime
## course :140 no : 72 no : 58 father: 91 Min. :1.000
## home :110 yes:310 yes:324 mother:275 1st Qu.:1.000
## other : 34 other : 16 Median :1.000
## reputation: 98 Mean :1.448
## 3rd Qu.:2.000
## Max. :4.000
## studytime failures schoolsup famsup paid activities
## Min. :1.000 Min. :0.0000 no :331 no :144 no :205 no :181
## 1st Qu.:1.000 1st Qu.:0.0000 yes: 51 yes:238 yes:177 yes:201
## Median :2.000 Median :0.0000
## Mean :2.037 Mean :0.2016
## 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :4.000 Max. :3.0000
## higher romantic famrel freetime goout
## no : 18 no :261 Min. :1.000 Min. :1.00 Min. :1.000
## yes:364 yes:121 1st Qu.:4.000 1st Qu.:3.00 1st Qu.:2.000
## Median :4.000 Median :3.00 Median :3.000
## Mean :3.937 Mean :3.22 Mean :3.113
## 3rd Qu.:5.000 3rd Qu.:4.00 3rd Qu.:4.000
## Max. :5.000 Max. :5.00 Max. :5.000
## Dalc Walc health absences
## Min. :1.000 Min. :1.000 Min. :1.000 Min. : 0.0
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:3.000 1st Qu.: 1.0
## Median :1.000 Median :2.000 Median :4.000 Median : 3.0
## Mean :1.482 Mean :2.296 Mean :3.573 Mean : 4.5
## 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.: 6.0
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :45.0
## G1 G2 G3 alc_use
## Min. : 2.00 Min. : 4.00 Min. : 0.00 Min. :1.000
## 1st Qu.:10.00 1st Qu.:10.00 1st Qu.:10.00 1st Qu.:1.000
## Median :12.00 Median :12.00 Median :12.00 Median :1.500
## Mean :11.49 Mean :11.47 Mean :11.46 Mean :1.889
## 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:2.500
## Max. :18.00 Max. :18.00 Max. :18.00 Max. :5.000
## high_use
## Mode :logical
## FALSE:268
## TRUE :114
##
##
##
#glimpse(alco_data)
# ####Grade:
#bar plot of the variables
gather(alco_data) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
To avoid, confusion, I will copy the data into a new data frame for MCA. Firstly, I will be extracting the categorical variables, as factors
#To avoid, confusion, I will copy the data into a new data frame for MCA
data_mca <- alco_data
#Firstly, I will be extracting the categorical variables, as factors
#now, filter them out for the MCA
data_mca_fac1<-Filter(is.factor, data_mca)
high_use<-factor(alco_data[,"high_use"])
data_mca_fac1 <- cbind.data.frame(data_mca_fac1, high_use)
#par(mfrow=c(1,2))
#now, perform the MCA on the catergorial/qualitative variables
mca_alco1 <- MCA(data_mca_fac1, graph = T)
#plot it
plot(mca_alco1, invisible=c("ind"), habillage="quali")
#par(mfrow=c(1,1))
# Getting a better biplot for MCA using ggplot
## number of categories per variable
categ = apply(data_mca_fac1, 2, function(x) nlevels(as.factor(x)))
categ
## school sex address famsize Pstatus Mjob
## 2 2 2 2 2 5
## Fjob reason nursery internet guardian schoolsup
## 5 4 2 2 3 2
## famsup paid activities higher romantic high_use
## 2 2 2 2 2 2
# data frame with variable coordinates
mca_vars_df = data.frame(mca_alco1$var$coord, Variable = rep(names(categ), categ))
#mca_vars_df
# data frame with observation coordinates
mca_obs_df = data.frame(mca_alco1$ind$coord)
# MCA plot of observations and categories
ggplot(data = mca_obs_df, aes(x = Dim.1, y = Dim.2)) +
geom_hline(yintercept = 0, colour = "gray70") +
geom_vline(xintercept = 0, colour = "gray70") +
geom_point(colour = "gray50", alpha = 0.0) +
geom_density2d(colour = "gray80") +
geom_text(data = mca_vars_df,
aes(x = Dim.1, y = Dim.2,
label = rownames(mca_vars_df), colour = Variable)) +
ggtitle("MCA plot of variables using R package FactoMineR") +
scale_colour_discrete(name = "Variable")
Here, we can see that female students generally attend school because of her reputation and get more school support. They also have family support and are able to attend paid extra classes. Most of their father’s job are in the health sector. High alcohol consumption is more rampant amongst female students compared to their male counterparts with high alcohol use. male students also attend the school based on course preference and other reasons. By and large, they do not attend paid extra classes compared to the female students and also do not get family and school support like the frmale students. Lastly, they mosly have no intention of pursuing higher education. This is explored further by categorising more continuous variables such as grades, absences, to allow for comparison with other variables.
Next, I will be categorising grade(G3), absences, Mother’s education(Medu), father’s education
#Next, I will be categorising grade(G3), absences, Mother's education(Medu),
#father's education
data_mca2<- alco_data
#now, convert mother and father's education level into something more understandable.
#data_mca$Medu<- factor(data_mca$Medu)
data_mca2$Medu<- factor(data_mca2$Medu, labels=c("no", "pr","5-9th", "sec", "hiE"))
data_mca2$Fedu<- factor(data_mca2$Medu, labels=c("no", "pr","5-9th", "sec", "hiE"))
#Now, let's categorise grades according to quartiles
bins_abs<- quantile(data_mca2$absences)
data_mca2$absences<-cut(data_mca2$absences, breaks=bins_abs, include.lowest=T,
labels=c("vlow", "low", "high", "vhigh"))
#same to grade(G3)
bins_gra<- quantile(data_mca2$G3)
data_mca2$G3<-cut(data_mca$G3, breaks=bins_gra, include.lowest=T,
labels=c("vlow", "low", "high", "vhigh"))
#Getting the columns that are factors
#since MCA is for categorical variables, I will be filtering the categorical
#variables/factors and also categorise some of the variables of interest
#such as absences, grade(G3). I will also Categorise other variables of interest that are in integer forms.
#let's first see the variables already in factor format
names(Filter(is.factor, data_mca2))
## [1] "school" "sex" "address" "famsize" "Pstatus"
## [6] "Medu" "Fedu" "Mjob" "Fjob" "reason"
## [11] "nursery" "internet" "guardian" "schoolsup" "famsup"
## [16] "paid" "activities" "higher" "romantic" "absences"
## [21] "G3"
#now, filter them out for the MCA
data_mca_fac2_<-Filter(is.factor, data_mca2)
#include the high alcohol use column
high_use<-factor(alco_data[,"high_use"])
#join with the dataframe with categorical varianles
data_mca_fac2_<-cbind.data.frame(data_mca_fac2_, high_use)
str(data_mca_fac2_)
## 'data.frame': 382 obs. of 22 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
## $ address : Factor w/ 2 levels "R","U": 1 1 1 1 1 1 1 1 1 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 1 1 1 1 1 2 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 2 2 2 2 2 2 2 2 2 1 ...
## $ Medu : Factor w/ 5 levels "no","pr","5-9th",..: 2 2 3 3 4 4 4 3 4 4 ...
## $ Fedu : Factor w/ 5 levels "no","pr","5-9th",..: 2 2 3 3 4 4 4 3 4 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 3 1 4 4 4 4 2 3 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 3 3 3 2 4 2 5 4 3 2 ...
## $ reason : Factor w/ 4 levels "course","home",..: 2 4 4 1 4 1 1 4 4 4 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 1 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 2 2 2 1 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 2 2 2 3 2 1 2 1 1 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 1 2 1 2 ...
## $ famsup : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 1 ...
## $ paid : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 2 1 1 ...
## $ activities: Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 1 1 1 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 2 1 1 1 ...
## $ absences : Factor w/ 4 levels "vlow","low","high",..: 2 2 4 2 3 2 1 1 4 4 ...
## $ G3 : Factor w/ 4 levels "vlow","low","high",..: 2 1 2 1 2 2 1 1 4 1 ...
## $ high_use : Factor w/ 2 levels "FALSE","TRUE": 1 2 1 1 2 1 1 1 2 1 ...
#The above can also be done with dplyr which has been loaded already too
#data_mca %>% Filter(f = is.factor)
#names of the variables again ?
names(data_mca_fac2_)
## [1] "school" "sex" "address" "famsize" "Pstatus"
## [6] "Medu" "Fedu" "Mjob" "Fjob" "reason"
## [11] "nursery" "internet" "guardian" "schoolsup" "famsup"
## [16] "paid" "activities" "higher" "romantic" "absences"
## [21] "G3" "high_use"
#Alternative for finding the column name that are factors
# names(data_)[ sapply(data_reg, is.factor) ]
# #or
# data_reg %>% Filter(f = is.factor) %>% names
# or
# data_reg %>% summarise_each(funs(is.factor)) %>% unlist %>% .[.] %>% names
#I will select few of the variables for clarity sake and include the sex too
data_mca_fac2<-data_mca_fac2_[,c("sex",names(data_mca_fac2_[,(10:ncol(data_mca_fac2_))]))]
#now, perform the MCA on the catergorial/qualitative variables
mca_alco2 <- MCA(data_mca_fac2, graph = T)
#plot it
plot(mca_alco2, invisible=c("ind"), habillage="quali")
# Getting a better biplot for MCA using ggplot
## number of categories per variable
categ2 = apply(data_mca_fac2, 2, function(x) nlevels(as.factor(x)))
categ2
## sex reason nursery internet guardian schoolsup
## 2 4 2 2 3 2
## famsup paid activities higher romantic absences
## 2 2 2 2 2 4
## G3 high_use
## 4 2
# data frame with variable coordinates
mca_vars_df2 = data.frame(mca_alco2$var$coord, Variable = rep(names(categ2), categ2))
#mca_vars_df2
# data frame with observation coordinates
mca_obs_df2 = data.frame(mca_alco2$ind$coord)
# MCA plot of observations and categories
ggplot(data = mca_obs_df2, aes(x = Dim.1, y = Dim.2)) +
geom_hline(yintercept = 0, colour = "gray70") +
geom_vline(xintercept = 0, colour = "gray70") +
geom_point(colour = "gray50", alpha = 0.1) +
geom_density2d(colour = "gray80") +
geom_text(data = mca_vars_df2,
aes(x = Dim.1, y = Dim.2,
label = rownames(mca_vars_df2), colour = Variable)) +
ggtitle("MCA plot of variables") +
scale_colour_discrete(name = "Variable")
Here, we can see the categorisation and association In the NW quadrant, it shows that those that pay for for extra classes in portugese and math, have family support and are mostly female students. They also get extra education support and have ambition for higher education.
In the NE quadrant, high alcohol use seem to be associated with guardian other than mother or father. Those with high alcohol use also seem to not have ambition for higher education. They also perform poorly in studies(low grade). and mostly choose school because it is closer to home amongst other reasons, other than reputation and course preference. They also do not seem to participate in extracurricular activities and are mostly in romantic relationship. They also high absences.
In the SW quadrant, it can be seen that those that have low absence perform very well in studies do not take excessive alcohol. They are also engaged in extracurricular activities and have their mother as their guardian. They also motivated to attend the school because of shool’s reputation.
In the SE quadrant, Male students seem to be more moderately absent and do not attend paid extra classes in portuguese and math. They mostly attend the school because of the course preference. They generally also have no internet access. They also mostly do not get school’s support. It also looks like their father are their guardian.
This will be further explored with family of regression models (GLM, GAM, GBM)
Before I proceed, there is need to determine the error distribution of the response variables here. As in many realistic cases, we do not have normal distribution(i.e gaussian), we could also have poisson distribution for count data and last is the binomial distribution (e.g Yes/No, Presence/Absence). That said, one of the assumptions of poisson distribution is that variance is greater than mean. Therefore, I will be creating a function to test this assumption and decide the kind of distribution. You can see more here
#Create function to test variance>mean assumption of poisson distribution
test.var.mn<- function(x){
if(var(x)>mean(x)){
print("VALID:The variance>mean assumption of poisson distribution is valid")
}
else{
print("INVALID:The variance>mean assumption of poisson distribution is invalid")
}
}
#see if Grade(G3) meets the assumption
test.var.mn(alco_data$G3)
## [1] "INVALID:The variance>mean assumption of poisson distribution is invalid"
#next, see if "absences" meets the assumption
test.var.mn(alco_data$absences)
## [1] "VALID:The variance>mean assumption of poisson distribution is valid"
It shows here that it is invalid for grade(G3) but valid for “absences” variable. Next, therefore, I will be exploring various regression models, with grade(G3) as gaussian, Absences as poisson and high alcohol use(High_use) as binomial.
As done earlier, the data will be copied into a new dataframe for all the regression analysis, to avoid confusion and alteration of the original data.
For GLM, predictors will be selected, by firstly using stepwise regression to eliminate the redundant variables. Afterwards, I will be employing the significance test from the model summary, anova test(Chi Square), and AIC values. I also checked if any of the predictor is curvillinear(i.e has higher order polynomial).
#As done earlier, the data will be copied into a new dataframe for
#all the regression analysis, to avoid confusion and alteration of
#the original data
data_reg <- alco_data
#for GLM, predictors were selected, by employing the significance test from the model
#summary, anova test(Chi Square), AIC values and stepwise regression. I also checked
#if any of the predictor is curvillinear(i.e has higher order polynomial)
grade_glm<-glm(G3~ sex + age+address+Medu+Fedu+
Pstatus+ traveltime+studytime+famsup+activities+higher+
internet+famrel+romantic+freetime+goout+ alc_use+ absences
,data=data_reg,family ="gaussian")
#First, I used the stepwise regression to eliminate the redundant variables.
stepAIC(grade_glm, direction = "both")
## Start: AIC=1950.36
## G3 ~ sex + age + address + Medu + Fedu + Pstatus + traveltime +
## studytime + famsup + activities + higher + internet + famrel +
## romantic + freetime + goout + alc_use + absences
##
## Df Deviance AIC
## - freetime 1 3322.6 1948.4
## - activities 1 3322.9 1948.4
## - Fedu 1 3325.3 1948.7
## - famrel 1 3326.5 1948.8
## - alc_use 1 3326.9 1948.9
## - age 1 3328.8 1949.1
## - absences 1 3330.2 1949.2
## - sex 1 3330.9 1949.3
## - Pstatus 1 3332.2 1949.5
## - traveltime 1 3336.1 1949.9
## - famsup 1 3336.2 1949.9
## <none> 3322.6 1950.4
## - internet 1 3340.6 1950.4
## - address 1 3342.1 1950.6
## - romantic 1 3344.4 1950.9
## - goout 1 3361.7 1952.8
## - Medu 1 3365.9 1953.3
## - studytime 1 3380.7 1955.0
## - higher 1 3494.9 1967.7
##
## Step: AIC=1948.37
## G3 ~ sex + age + address + Medu + Fedu + Pstatus + traveltime +
## studytime + famsup + activities + higher + internet + famrel +
## romantic + goout + alc_use + absences
##
## Df Deviance AIC
## - activities 1 3322.9 1946.4
## - Fedu 1 3325.3 1946.7
## - famrel 1 3326.7 1946.8
## - alc_use 1 3326.9 1946.9
## - age 1 3328.9 1947.1
## - absences 1 3330.4 1947.3
## - sex 1 3331.3 1947.4
## - Pstatus 1 3332.2 1947.5
## - famsup 1 3336.2 1947.9
## - traveltime 1 3336.3 1947.9
## <none> 3322.6 1948.4
## - internet 1 3340.8 1948.5
## - address 1 3342.1 1948.6
## - romantic 1 3344.4 1948.9
## + freetime 1 3322.6 1950.4
## - goout 1 3363.4 1951.0
## - Medu 1 3366.0 1951.3
## - studytime 1 3381.0 1953.0
## - higher 1 3495.0 1965.7
##
## Step: AIC=1946.4
## G3 ~ sex + age + address + Medu + Fedu + Pstatus + traveltime +
## studytime + famsup + higher + internet + famrel + romantic +
## goout + alc_use + absences
##
## Df Deviance AIC
## - Fedu 1 3325.7 1944.7
## - famrel 1 3327.1 1944.9
## - alc_use 1 3327.5 1944.9
## - age 1 3329.4 1945.1
## - absences 1 3330.6 1945.3
## - Pstatus 1 3332.3 1945.5
## - sex 1 3332.3 1945.5
## - traveltime 1 3336.5 1946.0
## - famsup 1 3336.9 1946.0
## <none> 3322.9 1946.4
## - internet 1 3341.3 1946.5
## - address 1 3342.2 1946.6
## - romantic 1 3344.5 1946.9
## + activities 1 3322.6 1948.4
## + freetime 1 3322.9 1948.4
## - goout 1 3363.4 1949.0
## - Medu 1 3366.6 1949.4
## - studytime 1 3382.8 1951.2
## - higher 1 3498.6 1964.1
##
## Step: AIC=1944.72
## G3 ~ sex + age + address + Medu + Pstatus + traveltime + studytime +
## famsup + higher + internet + famrel + romantic + goout +
## alc_use + absences
##
## Df Deviance AIC
## - famrel 1 3330.0 1943.2
## - alc_use 1 3330.1 1943.2
## - age 1 3332.5 1943.5
## - absences 1 3333.7 1943.6
## - sex 1 3334.9 1943.8
## - Pstatus 1 3335.1 1943.8
## - famsup 1 3338.6 1944.2
## - traveltime 1 3340.6 1944.4
## <none> 3325.7 1944.7
## - internet 1 3344.2 1944.8
## - address 1 3344.2 1944.8
## - romantic 1 3347.0 1945.2
## + Fedu 1 3322.9 1946.4
## + activities 1 3325.3 1946.7
## + freetime 1 3325.7 1946.7
## - goout 1 3366.3 1947.3
## - studytime 1 3383.9 1949.3
## - Medu 1 3417.6 1953.1
## - higher 1 3506.2 1962.9
##
## Step: AIC=1943.22
## G3 ~ sex + age + address + Medu + Pstatus + traveltime + studytime +
## famsup + higher + internet + romantic + goout + alc_use +
## absences
##
## Df Deviance AIC
## - age 1 3336.1 1941.9
## - alc_use 1 3336.2 1941.9
## - absences 1 3338.3 1942.2
## - Pstatus 1 3339.0 1942.2
## - sex 1 3340.6 1942.4
## - famsup 1 3342.9 1942.7
## - traveltime 1 3344.9 1942.9
## <none> 3330.0 1943.2
## - address 1 3348.3 1943.3
## - internet 1 3349.9 1943.5
## - romantic 1 3352.5 1943.8
## + famrel 1 3325.7 1944.7
## + Fedu 1 3327.1 1944.9
## + activities 1 3329.5 1945.2
## + freetime 1 3329.8 1945.2
## - goout 1 3368.2 1945.6
## - studytime 1 3388.6 1947.9
## - Medu 1 3421.8 1951.6
## - higher 1 3512.7 1961.6
##
## Step: AIC=1941.92
## G3 ~ sex + address + Medu + Pstatus + traveltime + studytime +
## famsup + higher + internet + romantic + goout + alc_use +
## absences
##
## Df Deviance AIC
## - alc_use 1 3343.3 1940.7
## - absences 1 3345.8 1941.0
## - Pstatus 1 3345.9 1941.0
## - famsup 1 3347.2 1941.2
## - sex 1 3347.8 1941.2
## - traveltime 1 3351.6 1941.7
## <none> 3336.1 1941.9
## - address 1 3357.0 1942.3
## - internet 1 3357.8 1942.4
## - romantic 1 3361.2 1942.8
## + age 1 3330.0 1943.2
## + famrel 1 3332.5 1943.5
## + Fedu 1 3332.9 1943.5
## + activities 1 3335.4 1943.8
## + freetime 1 3335.9 1943.9
## - goout 1 3379.0 1944.8
## - studytime 1 3392.7 1946.3
## - Medu 1 3430.7 1950.6
## - higher 1 3538.4 1962.4
##
## Step: AIC=1940.74
## G3 ~ sex + address + Medu + Pstatus + traveltime + studytime +
## famsup + higher + internet + romantic + goout + absences
##
## Df Deviance AIC
## - sex 1 3351.5 1939.7
## - Pstatus 1 3353.5 1939.9
## - famsup 1 3354.4 1940.0
## - absences 1 3356.7 1940.3
## <none> 3343.3 1940.7
## - traveltime 1 3360.9 1940.8
## - internet 1 3364.2 1941.1
## - address 1 3367.1 1941.5
## - romantic 1 3369.4 1941.7
## + alc_use 1 3336.1 1941.9
## + age 1 3336.2 1941.9
## + famrel 1 3337.8 1942.1
## + Fedu 1 3340.3 1942.4
## + activities 1 3342.0 1942.6
## + freetime 1 3343.1 1942.7
## - studytime 1 3407.7 1946.0
## - goout 1 3409.5 1946.2
## - Medu 1 3438.8 1949.5
## - higher 1 3544.7 1961.1
##
## Step: AIC=1939.67
## G3 ~ address + Medu + Pstatus + traveltime + studytime + famsup +
## higher + internet + romantic + goout + absences
##
## Df Deviance AIC
## - Pstatus 1 3361.3 1938.8
## - famsup 1 3365.7 1939.3
## - absences 1 3367.6 1939.5
## - traveltime 1 3367.9 1939.5
## <none> 3351.5 1939.7
## - address 1 3374.2 1940.2
## - internet 1 3374.9 1940.3
## + sex 1 3343.3 1940.7
## + age 1 3343.7 1940.8
## - romantic 1 3380.3 1941.0
## + famrel 1 3345.2 1941.0
## + alc_use 1 3347.8 1941.2
## + Fedu 1 3348.6 1941.3
## + activities 1 3349.3 1941.4
## + freetime 1 3350.6 1941.6
## - studytime 1 3408.3 1944.1
## - goout 1 3416.2 1945.0
## - Medu 1 3460.2 1949.9
## - higher 1 3545.5 1959.2
##
## Step: AIC=1938.78
## G3 ~ address + Medu + traveltime + studytime + famsup + higher +
## internet + romantic + goout + absences
##
## Df Deviance AIC
## - absences 1 3374.4 1938.3
## - famsup 1 3376.2 1938.5
## - traveltime 1 3377.5 1938.6
## <none> 3361.3 1938.8
## - internet 1 3382.6 1939.2
## - address 1 3386.2 1939.6
## + Pstatus 1 3351.5 1939.7
## + age 1 3352.6 1939.8
## - romantic 1 3388.5 1939.9
## + sex 1 3353.5 1939.9
## + famrel 1 3355.4 1940.1
## + alc_use 1 3357.1 1940.3
## + Fedu 1 3358.3 1940.5
## + activities 1 3359.9 1940.6
## + freetime 1 3360.6 1940.7
## - studytime 1 3416.9 1943.0
## - goout 1 3428.1 1944.3
## - Medu 1 3479.9 1950.0
## - higher 1 3558.4 1958.6
##
## Step: AIC=1938.27
## G3 ~ address + Medu + traveltime + studytime + famsup + higher +
## internet + romantic + goout
##
## Df Deviance AIC
## - famsup 1 3390.0 1938.0
## - traveltime 1 3390.0 1938.0
## <none> 3374.4 1938.3
## - internet 1 3393.4 1938.4
## + absences 1 3361.3 1938.8
## + age 1 3363.4 1939.0
## + sex 1 3364.2 1939.1
## - address 1 3402.2 1939.4
## + famrel 1 3367.5 1939.5
## + Pstatus 1 3367.6 1939.5
## + alc_use 1 3367.8 1939.5
## - romantic 1 3405.0 1939.7
## + Fedu 1 3371.1 1939.9
## + activities 1 3373.0 1940.1
## + freetime 1 3373.2 1940.1
## - studytime 1 3437.3 1943.3
## - goout 1 3447.8 1944.5
## - Medu 1 3487.6 1948.9
## - higher 1 3578.7 1958.7
##
## Step: AIC=1938.04
## G3 ~ address + Medu + traveltime + studytime + higher + internet +
## romantic + goout
##
## Df Deviance AIC
## - traveltime 1 3406.4 1937.9
## - internet 1 3407.3 1938.0
## <none> 3390.0 1938.0
## + famsup 1 3374.4 1938.3
## + sex 1 3376.1 1938.5
## + absences 1 3376.2 1938.5
## + age 1 3381.6 1939.1
## - address 1 3418.3 1939.2
## + Pstatus 1 3382.7 1939.2
## + famrel 1 3382.7 1939.2
## + alc_use 1 3384.0 1939.4
## - romantic 1 3421.0 1939.5
## + activities 1 3387.9 1939.8
## + Fedu 1 3388.0 1939.8
## + freetime 1 3388.9 1939.9
## - studytime 1 3445.9 1942.3
## - goout 1 3462.1 1944.1
## - Medu 1 3494.4 1947.6
## - higher 1 3589.9 1957.9
##
## Step: AIC=1937.88
## G3 ~ address + Medu + studytime + higher + internet + romantic +
## goout
##
## Df Deviance AIC
## - internet 1 3424.1 1937.9
## <none> 3406.4 1937.9
## + traveltime 1 3390.0 1938.0
## + famsup 1 3390.0 1938.0
## + absences 1 3393.3 1938.4
## + sex 1 3394.2 1938.5
## + age 1 3397.3 1938.9
## + alc_use 1 3398.4 1939.0
## + famrel 1 3399.1 1939.1
## + Pstatus 1 3399.1 1939.1
## - romantic 1 3437.4 1939.3
## + Fedu 1 3403.3 1939.5
## + activities 1 3404.4 1939.7
## + freetime 1 3405.0 1939.7
## - address 1 3455.4 1941.3
## - studytime 1 3469.6 1942.9
## - goout 1 3484.4 1944.5
## - Medu 1 3528.0 1949.3
## - higher 1 3605.9 1957.6
##
## Step: AIC=1937.86
## G3 ~ address + Medu + studytime + higher + romantic + goout
##
## Df Deviance AIC
## <none> 3424.1 1937.9
## + internet 1 3406.4 1937.9
## + traveltime 1 3407.3 1938.0
## + famsup 1 3409.4 1938.2
## + sex 1 3409.7 1938.2
## + age 1 3413.2 1938.6
## + absences 1 3413.2 1938.7
## + famrel 1 3415.2 1938.9
## - romantic 1 3452.0 1939.0
## + alc_use 1 3417.5 1939.1
## + Pstatus 1 3418.2 1939.2
## + Fedu 1 3420.7 1939.5
## + activities 1 3421.3 1939.5
## + freetime 1 3421.8 1939.6
## - address 1 3486.8 1942.8
## - studytime 1 3490.8 1943.2
## - goout 1 3497.2 1943.9
## - Medu 1 3565.4 1951.3
## - higher 1 3618.2 1956.9
##
## Call: glm(formula = G3 ~ address + Medu + studytime + higher + romantic +
## goout, family = "gaussian", data = data_reg)
##
## Coefficients:
## (Intercept) addressU Medu studytime higheryes
## 6.0992 1.0045 0.5743 0.5093 3.5013
## romanticyes goout
## -0.5891 -0.3955
##
## Degrees of Freedom: 381 Total (i.e. Null); 375 Residual
## Null Deviance: 4175
## Residual Deviance: 3424 AIC: 1938
#I got this: glm(G3~ address + Medu + studytime + higher + romantic
#goout,data=data_reg,family ="gaussian")
#Thereafter, anova Chi square test and the T.test from the model summary
#were used to further eliminate the insiginificant variables at level 0.05
#After this, "romantic" was slightly below the level and hence, removed
grade_glm<-glm(G3~ address + Medu + studytime + higher +
goout,data=data_reg,family ="gaussian")
#Anova test
anova(grade_glm, test="Chisq")
## Analysis of Deviance Table
##
## Model: gaussian, link: identity
##
## Response: G3
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 381 4174.8
## address 1 88.385 380 4086.4 0.0019172 **
## Medu 1 193.900 379 3892.5 4.314e-06 ***
## studytime 1 122.115 378 3770.4 0.0002652 ***
## higher 1 245.311 377 3525.1 2.352e-07 ***
## goout 1 73.150 376 3452.0 0.0047619 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova Chi square test and the T.test from the model summary were used to further eliminate the insiginificant variables at level 0.05.
Here, we can see that the significant factors affecting students’ grades include, address, mother’s level of education, studytime, ambition to further higher education, and going out. Mother’s education and higher education ambition seem to be the most significatn factors. However, mother’s education level is not curvillear(i.e does not have higher order polynomial(i.e “I(Medu^2)”)
In accordance to the principle of parsimony, I decided to use variables with significance level of 0.05 Therefore, Final model: grade_glm<-glm(G3~ address + Medu + studytime + higher + goout,data=data_reg,family =“gaussian”)
let’s also see the if any observation has a leverage and to further test the normality assumption while I also explore the how the residuals differ from the fitted
par(mfrow=c(2,2))
plot(grade_glm, which = c(1,2,5))
from this, the constant variance of the error seem to be in line. The distribution also appears to be normal from the Normal Q-Q plot although, slight divergence from the line at the lower level. Also, no observation seem to have a clear leverage over others aside a point lying quite farther from others. Nevertheless, gaussian distribution can be safely utilised for modelling the grde(G3) response variable.
Create Functions to calculate Mean Error and Root Mean Square Error(RMSE)
# function to calculate the mean absolute and RMSE
#function to calculate mean error
mean_error<- function(obs, pred){
me<-mean(abs(obs-pred))
return(me)
}
# Function that returns Root Mean Squared Error
rmse <- function(obs, pred){
rmse<-sqrt(mean((obs-pred)^2))
return(rmse)
}
This part is for the model validation. The data was divided into 70% training/calibration data and 30% testing/evaluation data. It was also boostrapped 10 times. The results of the models and the response curves were thereafter presented. dividing into 70:30
#First copy, the original data into new dataframe, to avoid confusion
data_reg<-alco_data
{rep<-10
for (i in 1:rep){
#print the index to see the iteration
print(i)
#Creare a 70 sample(with replacement) from the original data
rand<- sample(1:nrow(data_reg), size = 0.7*nrow(data_reg))
#70% for the train/calibration data
cal<- data_reg[rand,]
#remaining 30 for the test/evaluation data
eva<-data_reg[-rand,]
####GLM
#perform a Genelralised Linear Model(GLM)
grade_glm <- glm(G3~address + Medu + studytime + higher +
goout, data=cal, family = "poisson")
#predict into the test/evaluation data
grade_glm_pred<- predict.glm(object = grade_glm, newdata = eva, type="response")
#find the correlation between the train and test data.
cor_glm_grade<-cor(grade_glm_pred, eva$G3, method = "spearman")
#########
#mean error and root mean square error
#calculate the mean error
error_grade_glm<- cbind.data.frame(grade_glm_pred, eva$G3)
colnames(error_grade_glm) <- c("pred_glm_grade", "obs_grade")
#Use the function created earlier to calulcate the mean error and RMSE.
#Mean error
grade_glm_me <- mean_error(error_grade_glm$obs_grade, error_grade_glm$pred_glm_grade)
#RMSE
grade_glm_rmse <- rmse(error_grade_glm$obs_grade, error_grade_glm$pred_glm_grade)
#combine the dataframe of the mean error and RMSE
me_rmse_grade_glm <- rbind.data.frame(grade_glm_me, grade_glm_rmse)
#Change the column name to something more descriptive.
colnames(me_rmse_grade_glm)<- c("grade_glm")
#GAM
grade_gam <- gam(G3~ s(Medu, k=3) + higher + address + s(studytime, k=3) + higher +
goout, data = cal, family = "gaussian")
grade_gam_pred <- predict.gam(grade_gam, newdata = eva, type = "response")
obs_pred_grade_gam<- cbind.data.frame(grade_gam_pred, eva$G3)
colnames(obs_pred_grade_gam) <- c("pred_gam_grade", "obs_gam_grade")
#you can just calculate the correlation straight away
cor_gam_grade <- cor(grade_gam_pred, eva$G3, method = "spearman")
#########
#mean error and root mean square error
error_grade_gam<- cbind.data.frame(grade_gam_pred, eva$G3)
colnames(error_grade_gam) <- c("pred_gam_grade", "obs_grade")
grade_gam_me <- mean_error(error_grade_gam$obs_grade, error_grade_gam$pred_gam_grade)
grade_gam_rmse <- rmse(error_grade_gam$obs_grade, error_grade_gam$pred_gam_grade)
me_rmse_grade_gam <- rbind.data.frame(grade_gam_me, grade_gam_rmse)
colnames(me_rmse_grade_gam)<- c("grade_gam")
###################################################################3
#using the normal gbm, package.
#GBM
grade_gbm1<-gbm(formula = G3~ sex + age+address+Medu+Fedu+
Pstatus+ traveltime+studytime+famsup+activities+higher+
internet+famrel+romantic+freetime+goout+ alc_use+ absences, data=cal,
distribution = "gaussian",n.trees = 1000, shrinkage = 0.001, interaction.depth = 6,
bag.fraction = 0.75)
best.iter<-gbm.perf(grade_gbm1, plot.it = F, method = "OOB")
grade_gbm1_pred<- predict.gbm(object = grade_gbm1, newdata = eva, best.iter,
type="response")
cor_gbm1_grade <- cor(grade_gbm1_pred, eva$G3, method = "spearman")
#########
#mean error and root mean square error
error_grade_gbm1<- cbind.data.frame(grade_gbm1_pred, eva$G3)
colnames(error_grade_gbm1) <- c("pred_gbm1_grade", "obs_grade")
grade_gbm1_me <- mean_error(error_grade_gbm1$obs_grade, error_grade_gbm1$pred_gbm1_grade)
grade_gbm1_rmse <- rmse(error_grade_gbm1$obs_grade, error_grade_gbm1$pred_gbm1_grade)
me_rmse_grade_gbm1 <- rbind.data.frame(grade_gbm1_me, grade_gbm1_rmse)
colnames(me_rmse_grade_gbm1)<- c("grade_gbm1")
###################################################
#GBM(dismo package)
grade_gbm2 <- gbm.step(data=cal, gbm.x =c("sex", "age","address","Medu"
,"Fedu","Pstatus", "traveltime","studytime","famsup","activities","higher",
"internet","famrel","romantic","freetime","goout", "alc_use", "absences"), gbm.y = "G3",
bag.fraction=0.75, learning.rate = 0.001,
family="gaussian",n.trees=50, n.folds=10,
max.trees = 1000, tree.complexity = 6)
#This also works but can be done directly as shown in the prediction after this
#best.iter2<-gbm.perf(grade_gbm1, plot.it = T, method = "OOB")
# grade_gbm2_pred<- predict.gbm(object = grade_gbm2, newdata = eva, best.iter2,
# type="response")
#
#the prediction
grade_gbm2_pred <- predict.gbm(grade_gbm2, newdata = eva, n.trees=grade_gbm2$n.trees, type = "response")
#grade_pred_gbm2_ras <- predict(object=ras_stack,model=grade_gbm2, fun=predict,
# n.trees=grade_gbm2$n.trees, type="response")
#plot(grade_pred_gbm2_ras)
cor_gbm2_grade <- cor(grade_gbm2_pred, eva$G3, method = "spearman")
#########
#mean error and root mean square error
error_grade_gbm2<- cbind.data.frame(grade_gbm2_pred, eva$G3)
colnames(error_grade_gbm2) <- c("pred_gbm2_grade", "obs_grade")
grade_gbm2_me <- mean_error(error_grade_gbm2$obs_grade, error_grade_gbm2$pred_gbm2_grade)
grade_gbm2_rmse <- rmse(error_grade_gbm2$obs_grade, error_grade_gbm2$pred_gbm2_grade)
me_rmse_grade_gbm2 <- rbind.data.frame(grade_gbm2_me, grade_gbm2_rmse)
colnames(me_rmse_grade_gbm2)<- c("grade_gbm2")
}
#####All correlation
all_cor_grade <- cbind.data.frame(cor_glm_grade,cor_gam_grade,
cor_gbm1_grade, cor_gbm2_grade)
colnames(all_cor_grade)<- c("grade_glm", "grade_gam", "grade_gbm1", "grade_gbm2")
#####all error
all_error_grade <- cbind.data.frame(me_rmse_grade_glm, me_rmse_grade_gam,
me_rmse_grade_gbm1, me_rmse_grade_gbm2)
rownames(all_error_grade)<- c("mean abs error", "RMSE")
}
## [1] 1
##
##
## GBM STEP - version 2.9
##
## Performing cross-validation optimisation of a boosted regression tree model
## for NA and using a family of gaussian
## Using 267 observations and 18 predictors
## creating 10 initial models of 50 trees
##
## folds are unstratified
## total mean deviance = 8.7939
## tolerance is fixed at 0.0088
## ntrees resid. dev.
## 50 8.8165
## now adding trees...
## 100 8.7886
## 150 8.7567
## 200 8.7297
## 250 8.7006
## 300 8.6819
## 350 8.6623
## 400 8.648
## 450 8.6332
## 500 8.621
## 550 8.6123
## 600 8.6023
## 650 8.5969
## 700 8.5886
## 750 8.5841
## 800 8.5754
## 850 8.5717
## 900 8.5703
## 950 8.5693
## 1000 8.5665
##
## mean total deviance = 8.794
## mean residual deviance = 7.012
##
## estimated cv deviance = 8.567 ; se = 0.594
##
## training data correlation = 0.603
## cv correlation = 0.178 ; se = 0.047
##
## elapsed time - 0.19 minutes
##
## ########### warning ##########
##
## maximum tree limit reached - results may not be optimal
## - refit with faster learning rate or increase maximum number of trees
## [1] 2
##
##
## GBM STEP - version 2.9
##
## Performing cross-validation optimisation of a boosted regression tree model
## for NA and using a family of gaussian
## Using 267 observations and 18 predictors
## creating 10 initial models of 50 trees
##
## folds are unstratified
## total mean deviance = 10.9353
## tolerance is fixed at 0.0109
## ntrees resid. dev.
## 50 10.9176
## now adding trees...
## 100 10.815
## 150 10.7127
## 200 10.6227
## 250 10.5419
## 300 10.4624
## 350 10.387
## 400 10.319
## 450 10.2563
## 500 10.2015
## 550 10.1449
## 600 10.0895
## 650 10.0426
## 700 9.9966
## 750 9.9549
## 800 9.9163
## 850 9.8814
## 900 9.8498
## 950 9.8182
## 1000 9.7973
##
## mean total deviance = 10.935
## mean residual deviance = 8.089
##
## estimated cv deviance = 9.797 ; se = 0.873
##
## training data correlation = 0.642
## cv correlation = 0.339 ; se = 0.045
##
## elapsed time - 0.19 minutes
##
## ########### warning ##########
##
## maximum tree limit reached - results may not be optimal
## - refit with faster learning rate or increase maximum number of trees
## [1] 3
##
##
## GBM STEP - version 2.9
##
## Performing cross-validation optimisation of a boosted regression tree model
## for NA and using a family of gaussian
## Using 267 observations and 18 predictors
## creating 10 initial models of 50 trees
##
## folds are unstratified
## total mean deviance = 10.1374
## tolerance is fixed at 0.0101
## ntrees resid. dev.
## 50 10.1127
## now adding trees...
## 100 10.037
## 150 9.9703
## 200 9.9164
## 250 9.8651
## 300 9.8181
## 350 9.7813
## 400 9.7415
## 450 9.7154
## 500 9.6829
## 550 9.6531
## 600 9.6348
## 650 9.6159
## 700 9.5954
## 750 9.5782
## 800 9.5624
## 850 9.5494
## 900 9.5392
## 950 9.5279
## 1000 9.517
##
## mean total deviance = 10.137
## mean residual deviance = 7.69
##
## estimated cv deviance = 9.517 ; se = 0.758
##
## training data correlation = 0.625
## cv correlation = 0.265 ; se = 0.043
##
## elapsed time - 0.19 minutes
##
## ########### warning ##########
##
## maximum tree limit reached - results may not be optimal
## - refit with faster learning rate or increase maximum number of trees
## [1] 4
##
##
## GBM STEP - version 2.9
##
## Performing cross-validation optimisation of a boosted regression tree model
## for NA and using a family of gaussian
## Using 267 observations and 18 predictors
## creating 10 initial models of 50 trees
##
## folds are unstratified
## total mean deviance = 10.5415
## tolerance is fixed at 0.0105
## ntrees resid. dev.
## 50 10.5096
## now adding trees...
## 100 10.438
## 150 10.3736
## 200 10.3146
## 250 10.2625
## 300 10.2099
## 350 10.1721
## 400 10.1377
## 450 10.1031
## 500 10.0721
## 550 10.043
## 600 10.0145
## 650 9.9933
## 700 9.9719
## 750 9.9531
## 800 9.9349
## 850 9.9194
## 900 9.9037
## 950 9.8879
## 1000 9.8695
##
## mean total deviance = 10.542
## mean residual deviance = 7.93
##
## estimated cv deviance = 9.869 ; se = 0.529
##
## training data correlation = 0.647
## cv correlation = 0.258 ; se = 0.046
##
## elapsed time - 0.19 minutes
##
## ########### warning ##########
##
## maximum tree limit reached - results may not be optimal
## - refit with faster learning rate or increase maximum number of trees
## [1] 5
##
##
## GBM STEP - version 2.9
##
## Performing cross-validation optimisation of a boosted regression tree model
## for NA and using a family of gaussian
## Using 267 observations and 18 predictors
## creating 10 initial models of 50 trees
##
## folds are unstratified
## total mean deviance = 10.2566
## tolerance is fixed at 0.0103
## ntrees resid. dev.
## 50 10.2809
## now adding trees...
## 100 10.2246
## 150 10.1745
## 200 10.1239
## 250 10.0746
## 300 10.0301
## 350 9.9894
## 400 9.9485
## 450 9.9196
## 500 9.8957
## 550 9.8639
## 600 9.8361
## 650 9.814
## 700 9.7971
## 750 9.7783
## 800 9.762
## 850 9.7545
## 900 9.7386
## 950 9.7219
## 1000 9.713
##
## mean total deviance = 10.257
## mean residual deviance = 7.828
##
## estimated cv deviance = 9.713 ; se = 0.645
##
## training data correlation = 0.634
## cv correlation = 0.238 ; se = 0.052
##
## elapsed time - 0.18 minutes
##
## ########### warning ##########
##
## maximum tree limit reached - results may not be optimal
## - refit with faster learning rate or increase maximum number of trees
## [1] 6
##
##
## GBM STEP - version 2.9
##
## Performing cross-validation optimisation of a boosted regression tree model
## for NA and using a family of gaussian
## Using 267 observations and 18 predictors
## creating 10 initial models of 50 trees
##
## folds are unstratified
## total mean deviance = 10.9166
## tolerance is fixed at 0.0109
## ntrees resid. dev.
## 50 10.877
## now adding trees...
## 100 10.7766
## 150 10.6818
## 200 10.5994
## 250 10.5231
## 300 10.4488
## 350 10.3802
## 400 10.3169
## 450 10.2573
## 500 10.2042
## 550 10.1592
## 600 10.1237
## 650 10.0839
## 700 10.042
## 750 10.0126
## 800 9.9767
## 850 9.9502
## 900 9.9316
## 950 9.9097
## 1000 9.89
##
## mean total deviance = 10.917
## mean residual deviance = 8.031
##
## estimated cv deviance = 9.89 ; se = 0.863
##
## training data correlation = 0.635
## cv correlation = 0.298 ; se = 0.061
##
## elapsed time - 0.19 minutes
##
## ########### warning ##########
##
## maximum tree limit reached - results may not be optimal
## - refit with faster learning rate or increase maximum number of trees
## [1] 7
##
##
## GBM STEP - version 2.9
##
## Performing cross-validation optimisation of a boosted regression tree model
## for NA and using a family of gaussian
## Using 267 observations and 18 predictors
## creating 10 initial models of 50 trees
##
## folds are unstratified
## total mean deviance = 10.9652
## tolerance is fixed at 0.011
## ntrees resid. dev.
## 50 10.9315
## now adding trees...
## 100 10.8772
## 150 10.8351
## 200 10.7913
## 250 10.747
## 300 10.7086
## 350 10.6744
## 400 10.6383
## 450 10.6228
## 500 10.6039
## 550 10.5884
## 600 10.5684
## 650 10.5566
## 700 10.5443
## 750 10.5322
## 800 10.5242
## 850 10.5121
## 900 10.5099
## 950 10.5045
## 1000 10.5029
##
## mean total deviance = 10.965
## mean residual deviance = 8.435
##
## estimated cv deviance = 10.503 ; se = 1.021
##
## training data correlation = 0.631
## cv correlation = 0.211 ; se = 0.052
##
## elapsed time - 0.19 minutes
##
## ########### warning ##########
##
## maximum tree limit reached - results may not be optimal
## - refit with faster learning rate or increase maximum number of trees
## [1] 8
##
##
## GBM STEP - version 2.9
##
## Performing cross-validation optimisation of a boosted regression tree model
## for NA and using a family of gaussian
## Using 267 observations and 18 predictors
## creating 10 initial models of 50 trees
##
## folds are unstratified
## total mean deviance = 10.7438
## tolerance is fixed at 0.0107
## ntrees resid. dev.
## 50 10.6394
## now adding trees...
## 100 10.5352
## 150 10.4387
## 200 10.3513
## 250 10.2712
## 300 10.2035
## 350 10.14
## 400 10.0871
## 450 10.0398
## 500 9.9957
## 550 9.9518
## 600 9.9172
## 650 9.8844
## 700 9.8561
## 750 9.8327
## 800 9.812
## 850 9.7917
## 900 9.7734
## 950 9.7583
## 1000 9.7459
##
## mean total deviance = 10.744
## mean residual deviance = 8
##
## estimated cv deviance = 9.746 ; se = 0.684
##
## training data correlation = 0.622
## cv correlation = 0.309 ; se = 0.059
##
## elapsed time - 0.19 minutes
##
## ########### warning ##########
##
## maximum tree limit reached - results may not be optimal
## - refit with faster learning rate or increase maximum number of trees
## [1] 9
##
##
## GBM STEP - version 2.9
##
## Performing cross-validation optimisation of a boosted regression tree model
## for NA and using a family of gaussian
## Using 267 observations and 18 predictors
## creating 10 initial models of 50 trees
##
## folds are unstratified
## total mean deviance = 10.2843
## tolerance is fixed at 0.0103
## ntrees resid. dev.
## 50 10.2681
## now adding trees...
## 100 10.1843
## 150 10.1075
## 200 10.0367
## 250 9.9759
## 300 9.9167
## 350 9.8614
## 400 9.8137
## 450 9.7705
## 500 9.7301
## 550 9.7031
## 600 9.6721
## 650 9.6434
## 700 9.6211
## 750 9.5998
## 800 9.5814
## 850 9.5662
## 900 9.5568
## 950 9.5392
## 1000 9.5296
##
## mean total deviance = 10.284
## mean residual deviance = 7.753
##
## estimated cv deviance = 9.53 ; se = 0.743
##
## training data correlation = 0.631
## cv correlation = 0.296 ; se = 0.07
##
## elapsed time - 0.19 minutes
##
## ########### warning ##########
##
## maximum tree limit reached - results may not be optimal
## - refit with faster learning rate or increase maximum number of trees
## [1] 10
##
##
## GBM STEP - version 2.9
##
## Performing cross-validation optimisation of a boosted regression tree model
## for NA and using a family of gaussian
## Using 267 observations and 18 predictors
## creating 10 initial models of 50 trees
##
## folds are unstratified
## total mean deviance = 10.4431
## tolerance is fixed at 0.0104
## ntrees resid. dev.
## 50 10.4748
## now adding trees...
## 100 10.3803
## 150 10.2992
## 200 10.2174
## 250 10.1437
## 300 10.081
## 350 10.0145
## 400 9.959
## 450 9.9029
## 500 9.8558
## 550 9.8171
## 600 9.7766
## 650 9.7342
## 700 9.7024
## 750 9.6758
## 800 9.6466
## 850 9.622
## 900 9.5965
## 950 9.5781
## 1000 9.5551
##
## mean total deviance = 10.443
## mean residual deviance = 7.655
##
## estimated cv deviance = 9.555 ; se = 0.898
##
## training data correlation = 0.642
## cv correlation = 0.314 ; se = 0.067
##
## elapsed time - 0.18 minutes
##
## ########### warning ##########
##
## maximum tree limit reached - results may not be optimal
## - refit with faster learning rate or increase maximum number of trees
let’s see the correlation between the predicted and observed response variable
#correlation between the predicted and observed grade(G3)
all_cor_grade
## grade_glm grade_gam grade_gbm1 grade_gbm2
## 1 0.2734951 0.2807136 0.2540703 0.2529366
They all are low and not so much different.
**Below, we can see the errors of the various models.
#error of all the models
all_error_grade
## grade_glm grade_gam grade_gbm1 grade_gbm2
## mean abs error 2.624296 2.628967 2.653752 2.651859
## RMSE 3.321220 3.327466 3.372944 3.367398
There appears to be not much difference in the errors of all the models used.
I chose GBM because it is able to handle multicollinearity and complex interactions, it can also show the response curves and relative importance of predictors. GAM is also able to show response curves and their confidence intervals.
SUMMARY OF THE GBM FOR GRADE(G3)
summary(grade_gbm1)
## var rel.inf
## higher higher 26.0188316
## Medu Medu 12.5100743
## alc_use alc_use 9.2399806
## absences absences 7.8489797
## age age 6.8760066
## studytime studytime 6.6032340
## traveltime traveltime 6.1522710
## goout goout 5.7957168
## Fedu Fedu 3.9204703
## address address 3.4998539
## freetime freetime 2.6551947
## famsup famsup 1.7636931
## internet internet 1.7039128
## sex sex 1.6437026
## romantic romantic 1.2318752
## famrel famrel 1.2249596
## activities activities 0.9150867
## Pstatus Pstatus 0.3961565
here, we can see that higher education pursuit, mother’s educaton level, alc use, and absences seem to be most important factors affecting students performances. This is quite in line with the results I got from GLM and GAM, however, only mother’s education level and higher_use seem to be the significant factors. The least important factors are also shown in the GBM’s summary of relative importance.
now, let’s see the response curves of this from GAM
grade_gam <- gam(G3~ s(Medu, k=3) + higher + address + s(studytime, k=3) + higher + goout, data = data_reg,family = "gaussian")
plot(grade_gam, pages=1)
Here, from the smooth curve from GAM, we can see that mother’s education level has a positive effect on student’s performance. However, the confidence interval especially at the lower level. is quite large, which shows that there is a wide range of uncertainty. Perharps, more observations needed. This will be explored further in the response curves from GBM below.
best.iter1<-gbm.perf(grade_gbm1, plot.it = F, method = "OOB")
par(mfrow=c(2,2))
plot.gbm(grade_gbm1, "Medu", best.iter1)
plot.gbm(grade_gbm1, "higher", best.iter1)
plot.gbm(grade_gbm1, "alc_use", best.iter1)
plot.gbm(grade_gbm1, "absences", best.iter1)
par(mfrow=c(1,1))
As we can see here, the higher the level of education of the mother, their kids seem to perform better. Also, studentsä with intention to pursue higher education seem to perform better. Alcohol use and absences reduces performance and can result in dramatic reduction if it becomes chronic.
plot(predict.gbm(grade_gbm1, data_reg, best.iter1), data_reg$G3,
main="Observed vs Predicted grade")
lines(lowess(predict.gbm(grade_gbm1, data_reg, best.iter1), data_reg$G3), col="red", lwd=3)
r_grade <-cor.test(predict.gbm(grade_gbm1, data_reg, best.iter1), data_reg$G3)
r2grade <- r_grade$estimate^2
r2grade
## cor
## 0.3260634
legend("topleft", paste("r^2=", round(r2grade,3)))
We can see from the scatterplots that the selected variables, account for only 32% factors affecting the students’ grades.
For GLM, the selection of predictors was done as earlier by using stepwise regression, anova(Chis sq) test and significance test from the model summary.
##
## Call:
## glm(formula = absences ~ sex + age + address + Medu + Fedu +
## Pstatus + traveltime + studytime + famsup + activities +
## higher + internet + famrel + romantic + freetime + goout +
## high_use + G3, family = "poisson", data = data_reg)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.4872 -1.7699 -0.6181 0.7933 8.5118
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.74725 0.43714 1.709 0.08737 .
## sexM -0.33113 0.05515 -6.004 1.92e-09 ***
## age 0.09829 0.02175 4.520 6.17e-06 ***
## addressU -0.13937 0.06329 -2.202 0.02767 *
## Medu 0.12363 0.03159 3.914 9.07e-05 ***
## Fedu -0.04894 0.02963 -1.652 0.09860 .
## PstatusT -0.43752 0.07029 -6.224 4.85e-10 ***
## traveltime -0.07698 0.03844 -2.003 0.04521 *
## studytime -0.19572 0.03515 -5.568 2.58e-08 ***
## famsupyes 0.06916 0.05329 1.298 0.19433
## activitiesyes 0.13248 0.05151 2.572 0.01011 *
## higheryes -0.17379 0.11194 -1.553 0.12052
## internetyes 0.35447 0.08038 4.410 1.03e-05 ***
## famrel -0.02684 0.02678 -1.002 0.31631
## romanticyes 0.15072 0.05288 2.850 0.00437 **
## freetime -0.08304 0.02707 -3.068 0.00216 **
## goout 0.04036 0.02434 1.658 0.09727 .
## high_useTRUE 0.49462 0.05598 8.836 < 2e-16 ***
## G3 -0.01864 0.00819 -2.276 0.02287 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1917.2 on 381 degrees of freedom
## Residual deviance: 1582.2 on 363 degrees of freedom
## AIC: 2639.8
##
## Number of Fisher Scoring iterations: 5
## Start: AIC=2639.77
## absences ~ sex + age + address + Medu + Fedu + Pstatus + traveltime +
## studytime + famsup + activities + higher + internet + famrel +
## romantic + freetime + goout + high_use + G3
##
## Df Deviance AIC
## - famrel 1 1583.2 2638.8
## - famsup 1 1583.8 2639.5
## <none> 1582.2 2639.8
## - higher 1 1584.5 2640.1
## - Fedu 1 1584.9 2640.5
## - goout 1 1584.9 2640.5
## - traveltime 1 1586.2 2641.9
## - address 1 1586.9 2642.5
## - G3 1 1587.3 2642.9
## - activities 1 1588.8 2644.4
## - romantic 1 1590.2 2645.8
## - freetime 1 1591.5 2647.2
## - Medu 1 1597.6 2653.2
## - age 1 1602.4 2658.0
## - internet 1 1603.1 2658.7
## - studytime 1 1614.3 2669.9
## - Pstatus 1 1617.6 2673.2
## - sex 1 1618.5 2674.1
## - high_use 1 1658.0 2713.6
##
## Step: AIC=2638.77
## absences ~ sex + age + address + Medu + Fedu + Pstatus + traveltime +
## studytime + famsup + activities + higher + internet + romantic +
## freetime + goout + high_use + G3
##
## Df Deviance AIC
## - famsup 1 1584.9 2638.5
## <none> 1583.2 2638.8
## - higher 1 1585.5 2639.2
## - Fedu 1 1585.8 2639.4
## - goout 1 1585.9 2639.5
## - traveltime 1 1587.2 2640.8
## - address 1 1587.8 2641.5
## - G3 1 1588.5 2642.1
## - activities 1 1589.5 2643.2
## - romantic 1 1591.4 2645.0
## - freetime 1 1593.5 2647.1
## - Medu 1 1598.7 2652.3
## - age 1 1602.8 2656.4
## - internet 1 1603.8 2657.5
## - studytime 1 1615.2 2668.8
## - Pstatus 1 1618.7 2672.3
## - sex 1 1619.9 2673.5
## - high_use 1 1663.2 2716.8
##
## Step: AIC=2638.49
## absences ~ sex + age + address + Medu + Fedu + Pstatus + traveltime +
## studytime + activities + higher + internet + romantic + freetime +
## goout + high_use + G3
##
## Df Deviance AIC
## <none> 1584.9 2638.5
## - higher 1 1587.0 2638.6
## - Fedu 1 1587.1 2638.7
## - goout 1 1587.6 2639.2
## - traveltime 1 1588.6 2640.2
## - address 1 1589.9 2641.5
## - G3 1 1590.7 2642.3
## - activities 1 1590.8 2642.4
## - romantic 1 1593.2 2644.8
## - freetime 1 1595.1 2646.7
## - Medu 1 1601.0 2652.7
## - age 1 1603.7 2655.3
## - internet 1 1606.5 2658.1
## - studytime 1 1615.5 2667.1
## - Pstatus 1 1619.8 2671.4
## - sex 1 1623.8 2675.4
## - high_use 1 1664.6 2716.3
##
## Call: glm(formula = absences ~ sex + age + address + Medu + Fedu +
## Pstatus + traveltime + studytime + activities + higher +
## internet + romantic + freetime + goout + high_use + G3, family = "poisson",
## data = data_reg)
##
## Coefficients:
## (Intercept) sexM age addressU Medu
## 0.73842 -0.34066 0.09383 -0.14284 0.12595
## Fedu PstatusT traveltime studytime activitiesyes
## -0.04394 -0.43378 -0.07353 -0.18947 0.12497
## higheryes internetyes romanticyes freetime goout
## -0.16371 0.35877 0.15299 -0.08587 0.03984
## high_useTRUE G3
## 0.50207 -0.01980
##
## Degrees of Freedom: 381 Total (i.e. Null); 365 Residual
## Null Deviance: 1917
## Residual Deviance: 1585 AIC: 2638
## Analysis of Deviance Table
##
## Model: poisson, link: log
##
## Response: absences
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 381 1917.2
## sex 1 6.559 380 1910.7 0.0104342 *
## age 1 39.964 379 1870.7 2.587e-10 ***
## address 1 0.844 378 1869.8 0.3582118
## Medu 1 30.346 377 1839.5 3.615e-08 ***
## Fedu 1 1.355 376 1838.1 0.2444438
## Pstatus 1 37.928 375 1800.2 7.340e-10 ***
## traveltime 1 0.019 374 1800.2 0.8917349
## studytime 1 52.787 373 1747.4 3.717e-13 ***
## famsup 1 2.517 372 1744.9 0.1126040
## activities 1 3.515 371 1741.4 0.0608124 .
## higher 1 11.233 370 1730.1 0.0008034 ***
## internet 1 27.982 369 1702.2 1.224e-07 ***
## famrel 1 7.405 368 1694.8 0.0065034 **
## romantic 1 3.782 367 1691.0 0.0518000 .
## freetime 1 1.799 366 1689.2 0.1798908
## goout 1 24.357 365 1664.8 8.003e-07 ***
## high_use 1 77.516 364 1587.3 < 2.2e-16 ***
## G3 1 5.148 363 1582.2 0.0232754 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Final model
abse_glm<-glm(absences ~ sex + age + Medu +
Pstatus + studytime + higher +
internet + goout + alc_use + G3
,data=data_reg,family ="poisson")
summary(abse_glm)
##
## Call:
## glm(formula = absences ~ sex + age + Medu + Pstatus + studytime +
## higher + internet + goout + alc_use + G3, family = "poisson",
## data = data_reg)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.1134 -1.8059 -0.6326 0.8032 9.8400
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.170583 0.410599 0.415 0.67781
## sexM -0.406371 0.054637 -7.438 1.02e-13 ***
## age 0.095102 0.020967 4.536 5.74e-06 ***
## Medu 0.116579 0.024475 4.763 1.91e-06 ***
## PstatusT -0.458253 0.069217 -6.621 3.58e-11 ***
## studytime -0.156525 0.033700 -4.645 3.41e-06 ***
## higheryes -0.220171 0.108890 -2.022 0.04318 *
## internetyes 0.349942 0.078836 4.439 9.04e-06 ***
## goout 0.007283 0.023303 0.313 0.75463
## alc_use 0.220332 0.026020 8.468 < 2e-16 ***
## G3 -0.022122 0.008040 -2.751 0.00593 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1917.2 on 381 degrees of freedom
## Residual deviance: 1622.0 on 371 degrees of freedom
## AIC: 2663.6
##
## Number of Fisher Scoring iterations: 5
anova(abse_glm, test="Chisq")
## Analysis of Deviance Table
##
## Model: poisson, link: log
##
## Response: absences
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 381 1917.2
## sex 1 6.559 380 1910.7 0.010434 *
## age 1 39.964 379 1870.7 2.587e-10 ***
## Medu 1 28.611 378 1842.1 8.848e-08 ***
## Pstatus 1 37.104 377 1805.0 1.120e-09 ***
## studytime 1 50.651 376 1754.3 1.103e-12 ***
## higher 1 10.643 375 1743.7 0.001105 **
## internet 1 26.691 374 1717.0 2.387e-07 ***
## goout 1 15.196 373 1701.8 9.692e-05 ***
## alc_use 1 72.308 372 1629.5 < 2.2e-16 ***
## G3 1 7.512 371 1622.0 0.006130 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The factors that might be causing absence of students are above.
## [1] 1
##
##
## GBM STEP - version 2.9
##
## Performing cross-validation optimisation of a boosted regression tree model
## for NA and using a family of poisson
## Using 267 observations and 17 predictors
## creating 10 initial models of 50 trees
##
## folds are unstratified
## total mean deviance = 4.9355
## tolerance is fixed at 0.0049
## ntrees resid. dev.
## 50 4.9413
## now adding trees...
## 100 4.9032
## 150 4.873
## 200 4.8452
## 250 4.8221
## 300 4.802
## 350 4.783
## 400 4.7666
## 450 4.7533
## 500 4.7401
## 550 4.7287
## 600 4.7177
## 650 4.7097
## 700 4.7022
## 750 4.6947
## 800 4.6891
## 850 4.6851
## 900 4.68
## 950 4.6774
## 1000 4.6734
##
## mean total deviance = 4.935
## mean residual deviance = 3.824
##
## estimated cv deviance = 4.673 ; se = 0.511
##
## training data correlation = 0.601
## cv correlation = 0.224 ; se = 0.06
##
## elapsed time - 0.19 minutes
##
## ########### warning ##########
##
## maximum tree limit reached - results may not be optimal
## - refit with faster learning rate or increase maximum number of trees
## [1] 2
##
##
## GBM STEP - version 2.9
##
## Performing cross-validation optimisation of a boosted regression tree model
## for NA and using a family of poisson
## Using 267 observations and 17 predictors
## creating 10 initial models of 50 trees
##
## folds are unstratified
## total mean deviance = 5.0147
## tolerance is fixed at 0.005
## ntrees resid. dev.
## 50 4.9987
## now adding trees...
## 100 4.9687
## 150 4.945
## 200 4.9232
## 250 4.9021
## 300 4.8854
## 350 4.8693
## 400 4.8555
## 450 4.8427
## 500 4.8333
## 550 4.8247
## 600 4.8165
## 650 4.8098
## 700 4.8033
## 750 4.7988
## 800 4.7954
## 850 4.7938
## 900 4.7959
## 950 4.7934
## 1000 4.7925
##
## mean total deviance = 5.015
## mean residual deviance = 3.952
##
## estimated cv deviance = 4.793 ; se = 0.509
##
## training data correlation = 0.582
## cv correlation = 0.222 ; se = 0.062
##
## elapsed time - 0.18 minutes
##
## ########### warning ##########
##
## maximum tree limit reached - results may not be optimal
## - refit with faster learning rate or increase maximum number of trees
## [1] 3
##
##
## GBM STEP - version 2.9
##
## Performing cross-validation optimisation of a boosted regression tree model
## for NA and using a family of poisson
## Using 267 observations and 17 predictors
## creating 10 initial models of 50 trees
##
## folds are unstratified
## total mean deviance = 4.3837
## tolerance is fixed at 0.0044
## ntrees resid. dev.
## 50 4.4015
## now adding trees...
## 100 4.3906
## 150 4.3814
## 200 4.3722
## 250 4.3653
## 300 4.3581
## 350 4.3563
## 400 4.3525
## 450 4.3515
## 500 4.3513
## 550 4.3513
## 600 4.3508
## 650 4.3537
## 700 4.3533
## 750 4.3564
## 800 4.3595
## 850 4.3637
## 900 4.3661
## 950 4.3719
## 1000 4.3769
##
## mean total deviance = 4.384
## mean residual deviance = 3.78
##
## estimated cv deviance = 4.351 ; se = 0.381
##
## training data correlation = 0.544
## cv correlation = 0.127 ; se = 0.059
##
## elapsed time - 0.18 minutes
##
## ########### warning ##########
##
## maximum tree limit reached - results may not be optimal
## - refit with faster learning rate or increase maximum number of trees
## [1] 4
##
##
## GBM STEP - version 2.9
##
## Performing cross-validation optimisation of a boosted regression tree model
## for NA and using a family of poisson
## Using 267 observations and 17 predictors
## creating 10 initial models of 50 trees
##
## folds are unstratified
## total mean deviance = 4.7289
## tolerance is fixed at 0.0047
## ntrees resid. dev.
## 50 4.8103
## now adding trees...
## 100 4.791
## 150 4.7746
## 200 4.761
## 250 4.7493
## 300 4.7397
## 350 4.7286
## 400 4.7196
## 450 4.7128
## 500 4.7076
## 550 4.7048
## 600 4.7041
## 650 4.7038
## 700 4.7041
## 750 4.7056
## 800 4.7081
## 850 4.7104
## 900 4.7131
## 950 4.7153
## 1000 4.7183
##
## mean total deviance = 4.729
## mean residual deviance = 3.999
##
## estimated cv deviance = 4.704 ; se = 0.629
##
## training data correlation = 0.541
## cv correlation = 0.184 ; se = 0.065
##
## elapsed time - 0.18 minutes
##
## ########### warning ##########
##
## maximum tree limit reached - results may not be optimal
## - refit with faster learning rate or increase maximum number of trees
## [1] 5
##
##
## GBM STEP - version 2.9
##
## Performing cross-validation optimisation of a boosted regression tree model
## for NA and using a family of poisson
## Using 267 observations and 17 predictors
## creating 10 initial models of 50 trees
##
## folds are unstratified
## total mean deviance = 5.1873
## tolerance is fixed at 0.0052
## ntrees resid. dev.
## 50 5.234
## now adding trees...
## 100 5.2184
## 150 5.2026
## 200 5.1905
## 250 5.1767
## 300 5.1667
## 350 5.1604
## 400 5.1525
## 450 5.1452
## 500 5.1406
## 550 5.1342
## 600 5.134
## 650 5.1331
## 700 5.1318
## 750 5.1311
## 800 5.1298
## 850 5.1297
## 900 5.1296
## 950 5.1304
## 1000 5.1322
##
## mean total deviance = 5.187
## mean residual deviance = 4.17
##
## estimated cv deviance = 5.13 ; se = 0.701
##
## training data correlation = 0.565
## cv correlation = 0.159 ; se = 0.046
##
## elapsed time - 0.18 minutes
##
## ########### warning ##########
##
## maximum tree limit reached - results may not be optimal
## - refit with faster learning rate or increase maximum number of trees
## [1] 6
##
##
## GBM STEP - version 2.9
##
## Performing cross-validation optimisation of a boosted regression tree model
## for NA and using a family of poisson
## Using 267 observations and 17 predictors
## creating 10 initial models of 50 trees
##
## folds are unstratified
## total mean deviance = 4.7934
## tolerance is fixed at 0.0048
## ntrees resid. dev.
## 50 4.8443
## now adding trees...
## 100 4.8208
## 150 4.7992
## 200 4.7822
## 250 4.7659
## 300 4.7547
## 350 4.7442
## 400 4.7319
## 450 4.7236
## 500 4.7189
## 550 4.7173
## 600 4.7146
## 650 4.7129
## 700 4.7136
## 750 4.7139
## 800 4.7129
## 850 4.7143
## 900 4.7162
## 950 4.7177
## 1000 4.7179
##
## mean total deviance = 4.793
## mean residual deviance = 3.956
##
## estimated cv deviance = 4.713 ; se = 1.161
##
## training data correlation = 0.546
## cv correlation = 0.214 ; se = 0.05
##
## elapsed time - 0.19 minutes
##
## ########### warning ##########
##
## maximum tree limit reached - results may not be optimal
## - refit with faster learning rate or increase maximum number of trees
## [1] 7
##
##
## GBM STEP - version 2.9
##
## Performing cross-validation optimisation of a boosted regression tree model
## for NA and using a family of poisson
## Using 267 observations and 17 predictors
## creating 10 initial models of 50 trees
##
## folds are unstratified
## total mean deviance = 5.7434
## tolerance is fixed at 0.0057
## ntrees resid. dev.
## 50 5.7636
## now adding trees...
## 100 5.7322
## 150 5.7054
## 200 5.6804
## 250 5.6583
## 300 5.6374
## 350 5.6188
## 400 5.6024
## 450 5.5879
## 500 5.5754
## 550 5.5636
## 600 5.5519
## 650 5.5401
## 700 5.5291
## 750 5.517
## 800 5.5078
## 850 5.4974
## 900 5.4903
## 950 5.4844
## 1000 5.4767
##
## mean total deviance = 5.743
## mean residual deviance = 4.447
##
## estimated cv deviance = 5.477 ; se = 0.931
##
## training data correlation = 0.617
## cv correlation = 0.269 ; se = 0.052
##
## elapsed time - 0.19 minutes
##
## ########### warning ##########
##
## maximum tree limit reached - results may not be optimal
## - refit with faster learning rate or increase maximum number of trees
## [1] 8
##
##
## GBM STEP - version 2.9
##
## Performing cross-validation optimisation of a boosted regression tree model
## for NA and using a family of poisson
## Using 267 observations and 17 predictors
## creating 10 initial models of 50 trees
##
## folds are unstratified
## total mean deviance = 5.3767
## tolerance is fixed at 0.0054
## ntrees resid. dev.
## 50 5.3726
## now adding trees...
## 100 5.3469
## 150 5.3292
## 200 5.3137
## 250 5.2983
## 300 5.2833
## 350 5.2716
## 400 5.2644
## 450 5.2578
## 500 5.2513
## 550 5.2444
## 600 5.2397
## 650 5.2338
## 700 5.2288
## 750 5.2269
## 800 5.2268
## 850 5.2224
## 900 5.22
## 950 5.2161
## 1000 5.215
##
## mean total deviance = 5.377
## mean residual deviance = 4.318
##
## estimated cv deviance = 5.215 ; se = 0.803
##
## training data correlation = 0.569
## cv correlation = 0.226 ; se = 0.058
##
## elapsed time - 0.18 minutes
##
## ########### warning ##########
##
## maximum tree limit reached - results may not be optimal
## - refit with faster learning rate or increase maximum number of trees
## [1] 9
##
##
## GBM STEP - version 2.9
##
## Performing cross-validation optimisation of a boosted regression tree model
## for NA and using a family of poisson
## Using 267 observations and 17 predictors
## creating 10 initial models of 50 trees
##
## folds are unstratified
## total mean deviance = 5.1572
## tolerance is fixed at 0.0052
## ntrees resid. dev.
## 50 5.203
## now adding trees...
## 100 5.168
## 150 5.1386
## 200 5.1103
## 250 5.0847
## 300 5.0616
## 350 5.0431
## 400 5.0253
## 450 5.0099
## 500 4.9984
## 550 4.9861
## 600 4.9791
## 650 4.9706
## 700 4.9634
## 750 4.9575
## 800 4.9531
## 850 4.9525
## 900 4.9505
## 950 4.9495
## 1000 4.9472
##
## mean total deviance = 5.157
## mean residual deviance = 4.029
##
## estimated cv deviance = 4.947 ; se = 0.676
##
## training data correlation = 0.591
## cv correlation = 0.185 ; se = 0.062
##
## elapsed time - 0.19 minutes
##
## ########### warning ##########
##
## maximum tree limit reached - results may not be optimal
## - refit with faster learning rate or increase maximum number of trees
## [1] 10
##
##
## GBM STEP - version 2.9
##
## Performing cross-validation optimisation of a boosted regression tree model
## for NA and using a family of poisson
## Using 267 observations and 17 predictors
## creating 10 initial models of 50 trees
##
## folds are unstratified
## total mean deviance = 4.5998
## tolerance is fixed at 0.0046
## ntrees resid. dev.
## 50 4.6305
## now adding trees...
## 100 4.6161
## 150 4.6044
## 200 4.5934
## 250 4.5815
## 300 4.5711
## 350 4.5646
## 400 4.5589
## 450 4.5523
## 500 4.5473
## 550 4.5426
## 600 4.538
## 650 4.5342
## 700 4.5311
## 750 4.5284
## 800 4.5256
## 850 4.5266
## 900 4.5264
## 950 4.5258
## 1000 4.5246
##
## mean total deviance = 4.6
## mean residual deviance = 3.662
##
## estimated cv deviance = 4.525 ; se = 0.464
##
## training data correlation = 0.597
## cv correlation = 0.156 ; se = 0.058
##
## elapsed time - 0.19 minutes
##
## ########### warning ##########
##
## maximum tree limit reached - results may not be optimal
## - refit with faster learning rate or increase maximum number of trees
#correlation between the predicted and observed for all the models used.
all_cor_abse
## abse_glm abse_gam abse_gbm1 abse_gbm2
## 1 0.2246732 0.2231245 0.1993083 0.1881407
#The mean correlation
all_cor_abse
## abse_glm abse_gam abse_gbm1 abse_gbm2
## 1 0.2246732 0.2231245 0.1993083 0.1881407
#let's see the errors in the prediction for "absence" response variable.
all_error_abse
## abse_glm abse_gam abse_gbm1 abse_gbm2
## mean abs error 4.001197 3.968672 4.023673 3.951446
## RMSE 5.942599 5.966504 6.097966 6.048195
## var rel.inf
## alc_use alc_use 23.2721674
## age age 14.0840830
## Medu Medu 10.2271230
## Pstatus Pstatus 8.4032585
## romantic romantic 8.2467702
## freetime freetime 7.7315031
## goout goout 4.9877363
## sex sex 4.5833422
## Fedu Fedu 4.3660366
## studytime studytime 4.1481797
## famrel famrel 3.2338396
## activities activities 2.1010025
## famsup famsup 1.4164631
## address address 1.1068829
## higher higher 0.9220471
## internet internet 0.8971392
## traveltime traveltime 0.2724257
I chose GBM because it is able to handle multicollinearity and complex interactions, it can also show the response curves and relative importance of predictors.
Alcohol use, age, mother’s education, parents’ status, freetime and romantic relationship ostensibly, have the most effect on the cause of absences. Travel time and address seem to be the least
#now, let's see the response curves of this
#use all the dataset
abse_gam <- gam(absences~ sex + s(age, k=3) +s(Medu, k=3) +
Pstatus+ s(studytime, k=3) + higher +
internet + goout + s(alc_use, k=3) + s(G3, k=3), data = alco_data, family = "poisson")
plot(abse_gam, pages=1)
Teenage students are more likely to be absent although, there seems not to be enough data for older students above 20, as the confidence interval’ seem high. Also. the likelihood of absences reduces with increase in the level of education of the mother. Studytime also reduces this. While Alcohol increases this. It is quite unsure how grade affects.
Furthermore, I will be expploring this further with GBM reponse curves.
best.iter1<-gbm.perf(abse_gbm1, plot.it = F, method = "OOB")
par(mfrow=c(2,3))
plot.gbm(abse_gbm1, "alc_use", best.iter1)
plot.gbm(abse_gbm1, "age", best.iter1)
plot.gbm(abse_gbm1, "Medu", best.iter1)
plot.gbm(abse_gbm1, "Pstatus", best.iter1)
plot.gbm(abse_gbm1, "freetime", best.iter1)
plot.gbm(abse_gbm1, "romantic", best.iter1)
par(mfrow=c(1,1))
Alcohol use, age mother’s education seem to increase the tendency to be absent while freetime does the opposite. This is quite surprising as, I had expected the exact opposite. Also, students’ with parents apart have more tendency to be absent that those with their parents together. Likewise, students in romantic relationship are more likely to be absent than those without.
plot(predict.gbm(abse_gbm1, data_reg, best.iter1), data_reg$absences,
main="Observed vs Predicted absences")
lines(lowess(predict.gbm(abse_gbm1, data_reg, best.iter1), data_reg$absences), col="red", lwd=3)
r_abse <-cor.test(predict.gbm(abse_gbm1, data_reg, best.iter1), data_reg$absences)
r2abse <- r_abse$estimate^2
r2abse
## cor
## 0.2496773
legend("topleft", paste("r^2=", round(r2abse,3)))
We can see from the scatterplots that the selected variables, account for only 23%(the coefficient of determination) factors affecting the students’ absences.
Similar step was repeated here. However, in evaluating the models, I utilised Area Under Curve(AUC), odds ratio and confusion/classification matrix, instead.
##
## Call:
## glm(formula = high_use ~ sex + age + address + Medu + Fedu +
## Pstatus + traveltime + studytime + famsup + activities +
## higher + internet + famrel + romantic + freetime + goout +
## absences, family = "binomial", data = data_reg)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9022 -0.7522 -0.4473 0.6526 2.6517
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.759454 2.385066 -1.576 0.114969
## sexM 0.860723 0.291187 2.956 0.003117 **
## age 0.080169 0.120926 0.663 0.507357
## addressU -0.601436 0.350235 -1.717 0.085936 .
## Medu -0.091134 0.169343 -0.538 0.590464
## Fedu 0.150634 0.164444 0.916 0.359656
## PstatusT 0.026078 0.456176 0.057 0.954413
## traveltime 0.281860 0.201586 1.398 0.162050
## studytime -0.331499 0.180785 -1.834 0.066704 .
## famsupyes 0.005161 0.282845 0.018 0.985442
## activitiesyes -0.477285 0.276610 -1.725 0.084441 .
## higheryes 0.146713 0.595949 0.246 0.805539
## internetyes 0.190045 0.393884 0.482 0.629458
## famrel -0.419249 0.148466 -2.824 0.004745 **
## romanticyes -0.344881 0.297108 -1.161 0.245725
## freetime 0.154201 0.146426 1.053 0.292295
## goout 0.751213 0.133447 5.629 1.81e-08 ***
## absences 0.079062 0.022936 3.447 0.000567 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 360.78 on 364 degrees of freedom
## AIC: 396.78
##
## Number of Fisher Scoring iterations: 5
## Start: AIC=396.78
## high_use ~ sex + age + address + Medu + Fedu + Pstatus + traveltime +
## studytime + famsup + activities + higher + internet + famrel +
## romantic + freetime + goout + absences
##
## Df Deviance AIC
## - famsup 1 360.78 394.78
## - Pstatus 1 360.78 394.78
## - higher 1 360.84 394.84
## - internet 1 361.01 395.01
## - Medu 1 361.07 395.07
## - age 1 361.22 395.22
## - Fedu 1 361.62 395.62
## - freetime 1 361.89 395.89
## - romantic 1 362.15 396.15
## - traveltime 1 362.73 396.73
## <none> 360.78 396.78
## - address 1 363.69 397.69
## - activities 1 363.79 397.79
## - studytime 1 364.29 398.29
## - famrel 1 368.85 402.85
## - sex 1 369.73 403.73
## - absences 1 372.58 406.58
## - goout 1 397.21 431.21
##
## Step: AIC=394.78
## high_use ~ sex + age + address + Medu + Fedu + Pstatus + traveltime +
## studytime + activities + higher + internet + famrel + romantic +
## freetime + goout + absences
##
## Df Deviance AIC
## - Pstatus 1 360.78 392.78
## - higher 1 360.84 392.84
## - internet 1 361.02 393.02
## - Medu 1 361.07 393.07
## - age 1 361.22 393.22
## - Fedu 1 361.64 393.64
## - freetime 1 361.90 393.90
## - romantic 1 362.15 394.15
## - traveltime 1 362.73 394.73
## <none> 360.78 394.78
## - address 1 363.70 395.70
## - activities 1 363.79 395.79
## - studytime 1 364.33 396.33
## + famsup 1 360.78 396.78
## - famrel 1 368.86 400.86
## - sex 1 369.83 401.83
## - absences 1 372.59 404.59
## - goout 1 397.21 429.21
##
## Step: AIC=392.78
## high_use ~ sex + age + address + Medu + Fedu + traveltime + studytime +
## activities + higher + internet + famrel + romantic + freetime +
## goout + absences
##
## Df Deviance AIC
## - higher 1 360.84 390.84
## - internet 1 361.03 391.03
## - Medu 1 361.08 391.08
## - age 1 361.23 391.23
## - Fedu 1 361.64 391.64
## - freetime 1 361.91 391.91
## - romantic 1 362.16 392.16
## - traveltime 1 362.73 392.73
## <none> 360.78 392.78
## - address 1 363.71 393.71
## - activities 1 363.82 393.82
## - studytime 1 364.34 394.34
## + Pstatus 1 360.78 394.78
## + famsup 1 360.78 394.78
## - famrel 1 368.86 398.86
## - sex 1 369.84 399.84
## - absences 1 372.70 402.70
## - goout 1 397.23 427.23
##
## Step: AIC=390.84
## high_use ~ sex + age + address + Medu + Fedu + traveltime + studytime +
## activities + internet + famrel + romantic + freetime + goout +
## absences
##
## Df Deviance AIC
## - internet 1 361.07 389.07
## - Medu 1 361.13 389.13
## - age 1 361.24 389.24
## - Fedu 1 361.74 389.74
## - freetime 1 361.99 389.99
## - romantic 1 362.38 390.38
## - traveltime 1 362.79 390.79
## <none> 360.84 390.84
## - address 1 363.74 391.74
## - activities 1 363.82 391.82
## - studytime 1 364.34 392.34
## + higher 1 360.78 392.78
## + Pstatus 1 360.84 392.84
## + famsup 1 360.84 392.84
## - famrel 1 368.90 396.90
## - sex 1 369.92 397.92
## - absences 1 372.71 400.71
## - goout 1 397.25 425.25
##
## Step: AIC=389.07
## high_use ~ sex + age + address + Medu + Fedu + traveltime + studytime +
## activities + famrel + romantic + freetime + goout + absences
##
## Df Deviance AIC
## - Medu 1 361.31 387.31
## - age 1 361.47 387.47
## - Fedu 1 361.98 387.98
## - freetime 1 362.31 388.31
## - romantic 1 362.55 388.55
## - traveltime 1 363.01 389.01
## <none> 361.07 389.07
## - address 1 363.76 389.76
## - activities 1 363.96 389.96
## - studytime 1 364.51 390.51
## + internet 1 360.84 390.84
## + higher 1 361.03 391.03
## + Pstatus 1 361.06 391.06
## + famsup 1 361.07 391.07
## - famrel 1 369.07 395.07
## - sex 1 370.27 396.27
## - absences 1 373.49 399.49
## - goout 1 397.68 423.68
##
## Step: AIC=387.31
## high_use ~ sex + age + address + Fedu + traveltime + studytime +
## activities + famrel + romantic + freetime + goout + absences
##
## Df Deviance AIC
## - age 1 361.74 385.74
## - Fedu 1 362.01 386.01
## - freetime 1 362.51 386.51
## - romantic 1 362.81 386.81
## <none> 361.31 387.31
## - traveltime 1 363.38 387.38
## - address 1 364.12 388.12
## - activities 1 364.28 388.28
## - studytime 1 364.90 388.90
## + Medu 1 361.07 389.07
## + internet 1 361.13 389.13
## + higher 1 361.28 389.28
## + Pstatus 1 361.29 389.29
## + famsup 1 361.31 389.31
## - famrel 1 369.32 393.32
## - sex 1 370.32 394.32
## - absences 1 373.51 397.51
## - goout 1 397.77 421.77
##
## Step: AIC=385.74
## high_use ~ sex + address + Fedu + traveltime + studytime + activities +
## famrel + romantic + freetime + goout + absences
##
## Df Deviance AIC
## - Fedu 1 362.33 384.33
## - freetime 1 362.85 384.85
## - romantic 1 363.04 385.04
## <none> 361.74 385.74
## - traveltime 1 363.88 385.88
## - activities 1 364.82 386.82
## - address 1 364.95 386.95
## + age 1 361.31 387.31
## - studytime 1 365.32 387.32
## + Medu 1 361.47 387.47
## + internet 1 361.55 387.55
## + Pstatus 1 361.70 387.70
## + higher 1 361.73 387.73
## + famsup 1 361.74 387.74
## - famrel 1 369.56 391.56
## - sex 1 370.89 392.89
## - absences 1 374.53 396.53
## - goout 1 400.71 422.71
##
## Step: AIC=384.33
## high_use ~ sex + address + traveltime + studytime + activities +
## famrel + romantic + freetime + goout + absences
##
## Df Deviance AIC
## - freetime 1 363.35 383.35
## - romantic 1 363.62 383.62
## - traveltime 1 364.17 384.17
## <none> 362.33 384.33
## - activities 1 365.16 385.16
## - address 1 365.44 385.44
## + Fedu 1 361.74 385.74
## - studytime 1 365.86 385.86
## + age 1 362.01 386.01
## + internet 1 362.09 386.09
## + higher 1 362.29 386.29
## + famsup 1 362.32 386.32
## + Medu 1 362.32 386.32
## + Pstatus 1 362.32 386.32
## - famrel 1 370.21 390.21
## - sex 1 371.74 391.74
## - absences 1 375.32 395.32
## - goout 1 401.89 421.89
##
## Step: AIC=383.35
## high_use ~ sex + address + traveltime + studytime + activities +
## famrel + romantic + goout + absences
##
## Df Deviance AIC
## - romantic 1 364.45 382.45
## - traveltime 1 365.07 383.07
## <none> 363.35 383.35
## - activities 1 366.01 384.01
## + freetime 1 362.33 384.33
## - address 1 366.39 384.39
## + Fedu 1 362.85 384.85
## + internet 1 363.02 385.02
## + age 1 363.09 385.09
## - studytime 1 367.13 385.13
## + higher 1 363.30 385.30
## + famsup 1 363.32 385.32
## + Pstatus 1 363.32 385.32
## + Medu 1 363.34 385.34
## - famrel 1 370.64 388.64
## - sex 1 373.74 391.74
## - absences 1 375.96 393.96
## - goout 1 409.84 427.84
##
## Step: AIC=382.45
## high_use ~ sex + address + traveltime + studytime + activities +
## famrel + goout + absences
##
## Df Deviance AIC
## - traveltime 1 366.16 382.16
## <none> 364.45 382.45
## - activities 1 367.21 383.21
## + romantic 1 363.35 383.35
## - address 1 367.42 383.42
## + freetime 1 363.62 383.62
## + Fedu 1 363.95 383.95
## + internet 1 364.19 384.19
## + higher 1 364.27 384.27
## + age 1 364.31 384.31
## - studytime 1 368.37 384.37
## + Pstatus 1 364.41 384.41
## + famsup 1 364.43 384.43
## + Medu 1 364.44 384.44
## - famrel 1 371.46 387.46
## - sex 1 375.05 391.05
## - absences 1 376.50 392.50
## - goout 1 410.69 426.69
##
## Step: AIC=382.16
## high_use ~ sex + address + studytime + activities + famrel +
## goout + absences
##
## Df Deviance AIC
## <none> 366.16 382.16
## + traveltime 1 364.45 382.45
## - activities 1 368.97 382.97
## + romantic 1 365.07 383.07
## + freetime 1 365.45 383.45
## + Fedu 1 365.92 383.92
## + internet 1 365.95 383.95
## + age 1 365.95 383.95
## + higher 1 366.02 384.02
## + Pstatus 1 366.13 384.13
## + Medu 1 366.14 384.14
## + famsup 1 366.15 384.15
## - studytime 1 370.67 384.67
## - address 1 371.58 385.58
## - famrel 1 373.02 387.02
## - sex 1 376.96 390.96
## - absences 1 377.98 391.98
## - goout 1 413.90 427.90
##
## Call: glm(formula = high_use ~ sex + address + studytime + activities +
## famrel + goout + absences, family = "binomial", data = data_reg)
##
## Coefficients:
## (Intercept) sexM addressU studytime activitiesyes
## -1.35960 0.90337 -0.73174 -0.36291 -0.44562
## famrel goout absences
## -0.37614 0.80515 0.07543
##
## Degrees of Freedom: 381 Total (i.e. Null); 374 Residual
## Null Deviance: 465.7
## Residual Deviance: 366.2 AIC: 382.2
Final model(GLM) for high alcohol use
halc_glm<-glm(high_use ~sex + studytime + goout + absences
,data=data_reg,family ="binomial")
anova(halc_glm, test = "Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: high_use
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 381 465.68
## sex 1 14.732 380 450.95 0.0001239 ***
## studytime 1 10.242 379 440.71 0.0013731 **
## goout 1 46.759 378 393.95 8.027e-12 ***
## absences 1 12.641 377 381.31 0.0003775 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] 1
##
##
## GBM STEP - version 2.9
##
## Performing cross-validation optimisation of a boosted regression tree model
## for NA and using a family of bernoulli
## Using 267 observations and 17 predictors
## creating 10 initial models of 50 trees
##
## folds are stratified by prevalence
## total mean deviance = 1.2211
## tolerance is fixed at 0.0012
## ntrees resid. dev.
## 50 1.1994
## now adding trees...
## 100 1.1801
## 150 1.1631
## 200 1.1477
## 250 1.134
## 300 1.1209
## 350 1.1097
## 400 1.0984
## 450 1.0886
## 500 1.0794
## 550 1.0707
## 600 1.063
## 650 1.0561
## 700 1.0494
## 750 1.0433
## 800 1.0376
## 850 1.0326
## 900 1.0279
## 950 1.0235
## 1000 1.0196
## 1050 1.0156
## 1100 1.012
## 1150 1.0085
## 1200 1.0055
## 1250 1.0027
## 1300 1.0004
## 1350 0.9984
## 1400 0.9966
## 1450 0.9942
## 1500 0.9928
## 1550 0.9917
## 1600 0.9905
## 1650 0.9896
## 1700 0.989
## 1750 0.988
## 1800 0.9874
## 1850 0.9868
## 1900 0.9868
## 1950 0.9865
## 2000 0.9859
## 2050 0.9855
## 2100 0.9856
## 2150 0.9859
## 2200 0.9861
## 2250 0.9863
## 2300 0.9864
## 2350 0.9869
## 2400 0.9873
## 2450 0.988
## 2500 0.9885
## 2550 0.9895
##
## mean total deviance = 1.221
## mean residual deviance = 0.768
##
## estimated cv deviance = 0.985 ; se = 0.043
##
## training data correlation = 0.718
## cv correlation = 0.507 ; se = 0.046
##
## training data AUC score = 0.924
## cv AUC score = 0.74 ; se = 0.035
##
## elapsed time - 0.45 minutes
## [1] 2
##
##
## GBM STEP - version 2.9
##
## Performing cross-validation optimisation of a boosted regression tree model
## for NA and using a family of bernoulli
## Using 267 observations and 17 predictors
## creating 10 initial models of 50 trees
##
## folds are stratified by prevalence
## total mean deviance = 1.2274
## tolerance is fixed at 0.0012
## ntrees resid. dev.
## 50 1.2111
## now adding trees...
## 100 1.1969
## 150 1.1844
## 200 1.1736
## 250 1.1639
## 300 1.1556
## 350 1.148
## 400 1.1412
## 450 1.135
## 500 1.1297
## 550 1.1246
## 600 1.1196
## 650 1.1154
## 700 1.1119
## 750 1.1086
## 800 1.1058
## 850 1.1033
## 900 1.0999
## 950 1.0978
## 1000 1.0959
## 1050 1.0938
## 1100 1.0923
## 1150 1.0904
## 1200 1.0896
## 1250 1.0887
## 1300 1.0879
## 1350 1.087
## 1400 1.0865
## 1450 1.086
## 1500 1.0855
## 1550 1.0849
## 1600 1.0844
## 1650 1.0839
## 1700 1.0837
## 1750 1.0833
## 1800 1.0833
## 1850 1.0832
## 1900 1.0837
## 1950 1.0834
## 2000 1.0836
## 2050 1.0838
## 2100 1.0839
## 2150 1.0842
## 2200 1.0845
## 2250 1.0851
## 2300 1.0855
##
## mean total deviance = 1.227
## mean residual deviance = 0.858
##
## estimated cv deviance = 1.083 ; se = 0.023
##
## training data correlation = 0.671
## cv correlation = 0.398 ; se = 0.033
##
## training data AUC score = 0.901
## cv AUC score = 0.724 ; se = 0.024
##
## elapsed time - 0.44 minutes
## [1] 3
##
##
## GBM STEP - version 2.9
##
## Performing cross-validation optimisation of a boosted regression tree model
## for NA and using a family of bernoulli
## Using 267 observations and 17 predictors
## creating 10 initial models of 50 trees
##
## folds are stratified by prevalence
## total mean deviance = 1.2335
## tolerance is fixed at 0.0012
## ntrees resid. dev.
## 50 1.2126
## now adding trees...
## 100 1.1943
## 150 1.1781
## 200 1.163
## 250 1.1494
## 300 1.1365
## 350 1.1251
## 400 1.1144
## 450 1.1047
## 500 1.0952
## 550 1.0865
## 600 1.0783
## 650 1.0712
## 700 1.0649
## 750 1.0581
## 800 1.0528
## 850 1.0479
## 900 1.0436
## 950 1.0396
## 1000 1.0357
## 1050 1.0324
## 1100 1.029
## 1150 1.0259
## 1200 1.0234
## 1250 1.0209
## 1300 1.0186
## 1350 1.0171
## 1400 1.0154
## 1450 1.0137
## 1500 1.012
## 1550 1.0106
## 1600 1.0095
## 1650 1.0087
## 1700 1.0081
## 1750 1.0079
## 1800 1.0067
## 1850 1.0064
## 1900 1.0065
## 1950 1.0062
## 2000 1.0056
## 2050 1.0054
## 2100 1.0053
## 2150 1.0055
## 2200 1.0056
## 2250 1.0061
## 2300 1.0061
## 2350 1.0067
## 2400 1.007
## 2450 1.0079
## 2500 1.0085
##
## mean total deviance = 1.234
## mean residual deviance = 0.758
##
## estimated cv deviance = 1.005 ; se = 0.047
##
## training data correlation = 0.731
## cv correlation = 0.475 ; se = 0.05
##
## training data AUC score = 0.93
## cv AUC score = 0.778 ; se = 0.03
##
## elapsed time - 0.48 minutes
## [1] 4
##
##
## GBM STEP - version 2.9
##
## Performing cross-validation optimisation of a boosted regression tree model
## for NA and using a family of bernoulli
## Using 267 observations and 17 predictors
## creating 10 initial models of 50 trees
##
## folds are stratified by prevalence
## total mean deviance = 1.1946
## tolerance is fixed at 0.0012
## ntrees resid. dev.
## 50 1.1734
## now adding trees...
## 100 1.1551
## 150 1.1388
## 200 1.1237
## 250 1.1107
## 300 1.0991
## 350 1.0883
## 400 1.0787
## 450 1.0696
## 500 1.0616
## 550 1.054
## 600 1.0468
## 650 1.0401
## 700 1.0341
## 750 1.0283
## 800 1.0228
## 850 1.0176
## 900 1.0126
## 950 1.0079
## 1000 1.0036
## 1050 1
## 1100 0.9965
## 1150 0.9931
## 1200 0.9898
## 1250 0.9868
## 1300 0.9836
## 1350 0.9807
## 1400 0.9783
## 1450 0.9761
## 1500 0.9742
## 1550 0.9721
## 1600 0.9706
## 1650 0.9688
## 1700 0.967
## 1750 0.9658
## 1800 0.9647
## 1850 0.9635
## 1900 0.9625
## 1950 0.9613
## 2000 0.9602
## 2050 0.9594
## 2100 0.9585
## 2150 0.9578
## 2200 0.957
## 2250 0.9565
## 2300 0.9565
## 2350 0.9563
## 2400 0.9559
## 2450 0.9552
## 2500 0.9549
## 2550 0.9546
## 2600 0.9548
## 2650 0.9548
## 2700 0.9554
## 2750 0.9553
## 2800 0.9554
## 2850 0.9553
## 2900 0.9554
## 2950 0.9554
## 3000 0.9558
##
## mean total deviance = 1.195
## mean residual deviance = 0.667
##
## estimated cv deviance = 0.955 ; se = 0.033
##
## training data correlation = 0.765
## cv correlation = 0.509 ; se = 0.033
##
## training data AUC score = 0.941
## cv AUC score = 0.798 ; se = 0.021
##
## elapsed time - 0.59 minutes
##
## ########### warning ##########
##
## maximum tree limit reached - results may not be optimal
## - refit with faster learning rate or increase maximum number of trees
## [1] 5
##
##
## GBM STEP - version 2.9
##
## Performing cross-validation optimisation of a boosted regression tree model
## for NA and using a family of bernoulli
## Using 267 observations and 17 predictors
## creating 10 initial models of 50 trees
##
## folds are stratified by prevalence
## total mean deviance = 1.2335
## tolerance is fixed at 0.0012
## ntrees resid. dev.
## 50 1.2134
## now adding trees...
## 100 1.1956
## 150 1.1804
## 200 1.1666
## 250 1.1544
## 300 1.1438
## 350 1.1345
## 400 1.1259
## 450 1.1184
## 500 1.1118
## 550 1.1058
## 600 1.1006
## 650 1.0958
## 700 1.0913
## 750 1.0871
## 800 1.0832
## 850 1.08
## 900 1.0775
## 950 1.0747
## 1000 1.0721
## 1050 1.0701
## 1100 1.0681
## 1150 1.0666
## 1200 1.0651
## 1250 1.0634
## 1300 1.0622
## 1350 1.0614
## 1400 1.0601
## 1450 1.0594
## 1500 1.0592
## 1550 1.0583
## 1600 1.0573
## 1650 1.0571
## 1700 1.057
## 1750 1.0564
## 1800 1.0562
## 1850 1.0563
## 1900 1.0564
## 1950 1.0566
## 2000 1.0571
## 2050 1.0573
## 2100 1.0574
## 2150 1.0576
## 2200 1.0585
## 2250 1.0587
## 2300 1.0593
##
## mean total deviance = 1.234
## mean residual deviance = 0.833
##
## estimated cv deviance = 1.056 ; se = 0.037
##
## training data correlation = 0.685
## cv correlation = 0.44 ; se = 0.046
##
## training data AUC score = 0.913
## cv AUC score = 0.717 ; se = 0.034
##
## elapsed time - 0.44 minutes
## [1] 6
##
##
## GBM STEP - version 2.9
##
## Performing cross-validation optimisation of a boosted regression tree model
## for NA and using a family of bernoulli
## Using 267 observations and 17 predictors
## creating 10 initial models of 50 trees
##
## folds are stratified by prevalence
## total mean deviance = 1.2211
## tolerance is fixed at 0.0012
## ntrees resid. dev.
## 50 1.2042
## now adding trees...
## 100 1.189
## 150 1.1753
## 200 1.1624
## 250 1.1508
## 300 1.1399
## 350 1.1301
## 400 1.1209
## 450 1.1123
## 500 1.1043
## 550 1.0969
## 600 1.09
## 650 1.0834
## 700 1.0771
## 750 1.0718
## 800 1.0668
## 850 1.062
## 900 1.0574
## 950 1.0535
## 1000 1.05
## 1050 1.0466
## 1100 1.0433
## 1150 1.0401
## 1200 1.0372
## 1250 1.0347
## 1300 1.0326
## 1350 1.0303
## 1400 1.0281
## 1450 1.0257
## 1500 1.0245
## 1550 1.0232
## 1600 1.0219
## 1650 1.0207
## 1700 1.0196
## 1750 1.0184
## 1800 1.0172
## 1850 1.0163
## 1900 1.0159
## 1950 1.0152
## 2000 1.0145
## 2050 1.0143
## 2100 1.0138
## 2150 1.0137
## 2200 1.0136
## 2250 1.0133
## 2300 1.0134
## 2350 1.0136
## 2400 1.0137
## 2450 1.014
## 2500 1.0142
## 2550 1.0143
## 2600 1.0153
## 2650 1.0157
## 2700 1.0159
##
## mean total deviance = 1.221
## mean residual deviance = 0.749
##
## estimated cv deviance = 1.013 ; se = 0.048
##
## training data correlation = 0.732
## cv correlation = 0.462 ; se = 0.055
##
## training data AUC score = 0.928
## cv AUC score = 0.763 ; se = 0.026
##
## elapsed time - 0.52 minutes
## [1] 7
##
##
## GBM STEP - version 2.9
##
## Performing cross-validation optimisation of a boosted regression tree model
## for NA and using a family of bernoulli
## Using 267 observations and 17 predictors
## creating 10 initial models of 50 trees
##
## folds are stratified by prevalence
## total mean deviance = 1.2335
## tolerance is fixed at 0.0012
## ntrees resid. dev.
## 50 1.217
## now adding trees...
## 100 1.2024
## 150 1.1888
## 200 1.1772
## 250 1.1667
## 300 1.1563
## 350 1.1469
## 400 1.1385
## 450 1.1309
## 500 1.1237
## 550 1.1168
## 600 1.111
## 650 1.1058
## 700 1.1005
## 750 1.0957
## 800 1.0911
## 850 1.0866
## 900 1.083
## 950 1.0795
## 1000 1.0761
## 1050 1.0733
## 1100 1.0705
## 1150 1.0679
## 1200 1.0656
## 1250 1.0635
## 1300 1.0609
## 1350 1.0588
## 1400 1.0567
## 1450 1.0544
## 1500 1.0529
## 1550 1.0511
## 1600 1.0503
## 1650 1.0493
## 1700 1.0479
## 1750 1.0469
## 1800 1.0467
## 1850 1.0465
## 1900 1.0459
## 1950 1.0455
## 2000 1.0447
## 2050 1.0444
## 2100 1.0443
## 2150 1.0444
## 2200 1.0441
## 2250 1.044
## 2300 1.0446
## 2350 1.0443
## 2400 1.0449
## 2450 1.0447
## 2500 1.0444
## 2550 1.0447
## 2600 1.045
## 2650 1.0454
##
## mean total deviance = 1.234
## mean residual deviance = 0.779
##
## estimated cv deviance = 1.044 ; se = 0.053
##
## training data correlation = 0.716
## cv correlation = 0.443 ; se = 0.062
##
## training data AUC score = 0.917
## cv AUC score = 0.752 ; se = 0.039
##
## elapsed time - 0.51 minutes
## [1] 8
##
##
## GBM STEP - version 2.9
##
## Performing cross-validation optimisation of a boosted regression tree model
## for NA and using a family of bernoulli
## Using 267 observations and 17 predictors
## creating 10 initial models of 50 trees
##
## folds are stratified by prevalence
## total mean deviance = 1.2396
## tolerance is fixed at 0.0012
## ntrees resid. dev.
## 50 1.2206
## now adding trees...
## 100 1.2045
## 150 1.1903
## 200 1.1772
## 250 1.1655
## 300 1.1555
## 350 1.1466
## 400 1.1388
## 450 1.1323
## 500 1.1262
## 550 1.121
## 600 1.1166
## 650 1.1126
## 700 1.1088
## 750 1.1049
## 800 1.1017
## 850 1.0986
## 900 1.0958
## 950 1.0936
## 1000 1.0913
## 1050 1.0892
## 1100 1.0872
## 1150 1.0858
## 1200 1.0842
## 1250 1.0826
## 1300 1.0817
## 1350 1.0805
## 1400 1.08
## 1450 1.0795
## 1500 1.0788
## 1550 1.0781
## 1600 1.0779
## 1650 1.0776
## 1700 1.0771
## 1750 1.0771
## 1800 1.077
## 1850 1.0774
## 1900 1.0776
## 1950 1.0775
## 2000 1.0775
## 2050 1.0777
## 2100 1.0781
## 2150 1.0782
## 2200 1.0781
## 2250 1.0787
##
## mean total deviance = 1.24
## mean residual deviance = 0.853
##
## estimated cv deviance = 1.077 ; se = 0.059
##
## training data correlation = 0.674
## cv correlation = 0.403 ; se = 0.083
##
## training data AUC score = 0.896
## cv AUC score = 0.725 ; se = 0.047
##
## elapsed time - 0.43 minutes
## [1] 9
##
##
## GBM STEP - version 2.9
##
## Performing cross-validation optimisation of a boosted regression tree model
## for NA and using a family of bernoulli
## Using 267 observations and 17 predictors
## creating 10 initial models of 50 trees
##
## folds are stratified by prevalence
## total mean deviance = 1.2455
## tolerance is fixed at 0.0012
## ntrees resid. dev.
## 50 1.2281
## now adding trees...
## 100 1.2122
## 150 1.1983
## 200 1.1866
## 250 1.1755
## 300 1.1659
## 350 1.1571
## 400 1.1492
## 450 1.1424
## 500 1.1355
## 550 1.1296
## 600 1.1242
## 650 1.1194
## 700 1.1146
## 750 1.1105
## 800 1.107
## 850 1.1033
## 900 1.0996
## 950 1.0972
## 1000 1.0944
## 1050 1.0921
## 1100 1.0897
## 1150 1.088
## 1200 1.0865
## 1250 1.085
## 1300 1.0841
## 1350 1.0832
## 1400 1.0821
## 1450 1.0813
## 1500 1.0808
## 1550 1.0798
## 1600 1.0795
## 1650 1.0791
## 1700 1.0789
## 1750 1.0788
## 1800 1.079
## 1850 1.0794
## 1900 1.0796
## 1950 1.0799
## 2000 1.0802
## 2050 1.081
## 2100 1.0816
## 2150 1.0822
## 2200 1.0824
##
## mean total deviance = 1.245
## mean residual deviance = 0.852
##
## estimated cv deviance = 1.079 ; se = 0.052
##
## training data correlation = 0.682
## cv correlation = 0.416 ; se = 0.072
##
## training data AUC score = 0.897
## cv AUC score = 0.714 ; se = 0.04
##
## elapsed time - 0.42 minutes
## [1] 10
##
##
## GBM STEP - version 2.9
##
## Performing cross-validation optimisation of a boosted regression tree model
## for NA and using a family of bernoulli
## Using 267 observations and 17 predictors
## creating 10 initial models of 50 trees
##
## folds are stratified by prevalence
## total mean deviance = 1.2274
## tolerance is fixed at 0.0012
## ntrees resid. dev.
## 50 1.2034
## now adding trees...
## 100 1.1822
## 150 1.1637
## 200 1.1478
## 250 1.1328
## 300 1.1192
## 350 1.1068
## 400 1.0954
## 450 1.0852
## 500 1.0759
## 550 1.0678
## 600 1.0598
## 650 1.0524
## 700 1.0459
## 750 1.0404
## 800 1.0358
## 850 1.0311
## 900 1.0271
## 950 1.0237
## 1000 1.0207
## 1050 1.0174
## 1100 1.015
## 1150 1.0126
## 1200 1.0103
## 1250 1.0083
## 1300 1.0063
## 1350 1.0048
## 1400 1.0033
## 1450 1.0023
## 1500 1.0011
## 1550 0.9993
## 1600 0.9988
## 1650 0.9981
## 1700 0.9976
## 1750 0.9973
## 1800 0.9966
## 1850 0.9965
## 1900 0.9962
## 1950 0.9958
## 2000 0.9957
## 2050 0.9959
## 2100 0.9959
## 2150 0.9962
## 2200 0.9966
## 2250 0.997
## 2300 0.9971
## 2350 0.997
## 2400 0.9974
## 2450 0.9981
##
## mean total deviance = 1.227
## mean residual deviance = 0.762
##
## estimated cv deviance = 0.996 ; se = 0.065
##
## training data correlation = 0.719
## cv correlation = 0.486 ; se = 0.076
##
## training data AUC score = 0.939
## cv AUC score = 0.768 ; se = 0.039
##
## elapsed time - 0.39 minutes
# AUC values through the various iterations.
compared_model_halc
## halc_auc_glm halc_auc_gam halc_auc_gbm1 halc_auc_gbm2
## 1 0.7755991 0.7705156 0.7418301 0.7447349
## 2 0.8534738 0.8534738 0.8100517 0.8215078
## 3 0.7142319 0.7142319 0.7164910 0.7198795
## 4 0.7127478 0.7127478 0.6866029 0.6746411
## 5 0.7863328 0.7855798 0.7865211 0.8045934
## 6 0.7516340 0.7523602 0.7734205 0.7799564
## 7 0.7432229 0.7443524 0.7951807 0.7725904
## 8 0.7991551 0.7953149 0.8206605 0.8294931
## 9 0.7668627 0.7625490 0.7917647 0.7854902
## 10 0.7390983 0.7431633 0.6906874 0.7250554
#The mean AUC values for the various models
mean_auc_halc<-colMeans(compared_model_halc)
# attach(compared_model_halc)
#This was used temporarily to see if other models are significantly better.
# wilcox.test(halc_auc_gbm1, halc_auc_gam, paird=T)
#let's see the mean auc values for all the models
mean_auc_halc
## halc_auc_glm halc_auc_gam halc_auc_gbm1 halc_auc_gbm2
## 0.7642358 0.7634289 0.7613211 0.7657942
ROC(AUC) curves of the various models.
#show the AUC curves from the last iteration on the test data
par(mfrow=c(2,2))
colAUC(halc_glm_pred, eva$high_use , plotROC = T)
## [,1]
## FALSE vs. TRUE 0.7390983
colAUC(pred_halc_gam, eva$high_use , plotROC = T)
## [,1]
## FALSE vs. TRUE 0.7431633
colAUC(pred_halc_gbm1, eva$high_use , plotROC = T)
## [,1]
## FALSE vs. TRUE 0.6906874
colAUC(pred_halc_gbm2, eva$high_use , plotROC = T)
## [,1]
## FALSE vs. TRUE 0.7250554
par(mfrow=c(1,1))
Here, I utilised Area Under Curve for evaluating my model. This is because it prevents subjectiveness in selecting thresholds which can significantly affect the predictive performance and commission and omission error. Although, measures, such as prevalence have been recommended for dealing with selection of threshold to conver into binary(e.g, True or False). AUC readily takes care of this by considering all the possible thresholds and evaluates the performance, accordingly. a value of >0.9 is an excellent model while AUC values with range 0.7-0.8, 0.5-0.7, 0.5 are fairly good, poor and very poor respectively. Here, my AUC values for all the models thorugh my boosting and resampling are about 0.7 for the three different models(GLm, GAM, GBM) which I applied.
####################################################
#Model for alcohol use
#RESPONSE CURVES AND MODEL INTERPRETATIONS.
#Here, i decided to use the entire data for the analysis
#GAM
halc_gam<-gam(high_use~ sex+ s(studytime, k=3) + s(goout, k=3) +
s(absences, k=3) , data =data_reg, family = "binomial")
summary(halc_gam)
##
## Family: binomial
## Link function: logit
##
## Formula:
## high_use ~ sex + s(studytime, k = 3) + s(goout, k = 3) + s(absences,
## k = 3)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.4490 0.1983 -7.308 2.72e-13 ***
## sexM 0.7817 0.2656 2.944 0.00324 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(studytime) 1 1 6.095 0.0136 *
## s(goout) 1 1 36.715 1.37e-09 ***
## s(absences) 1 1 12.118 0.0005 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.227 Deviance explained = 18.1%
## UBRE = 0.024361 Scale est. = 1 n = 382
#plot the response curves from GAM
plot(halc_gam, pages=1)
From the above, we can see the response of high alcohol use to the various predictors and also the confidence interval. There appears to be lesser tendency of high alcohol use when study time increases. This is expected, as the student has lesser time for social activities and parties. on the other hand, expectedly, going out increases the tendency of high alcohol use. In similar vein, absence from school increases the tendency too. but as it can be seen from the plot, the confidence interval seems to reduce with increase in number of absences, which might be an indication of insufficient data from that region.
To get a deeper, insight, I will explore this further with GBM
#GBM
halc_gbm1<-gbm(formula = high_use~ sex + age+address+Medu+Fedu+
Pstatus+ traveltime+studytime+famsup+activities+higher+
internet+famrel+romantic+freetime+goout+ absences, data=data_reg,
distribution = "bernoulli",n.trees = 2800, shrinkage = 0.001, interaction.depth = 6,
bag.fraction = 0.75)
summary(halc_gbm1)
## var rel.inf
## goout goout 29.7986228
## absences absences 18.6109317
## sex sex 13.6762991
## famrel famrel 7.6715637
## traveltime traveltime 6.7131311
## Medu Medu 4.0753433
## studytime studytime 3.6682986
## freetime freetime 3.3699362
## age age 2.6712668
## activities activities 2.5153293
## Fedu Fedu 2.2009417
## address address 1.9756887
## famsup famsup 1.4060125
## romantic romantic 1.0463254
## internet internet 0.3687841
## Pstatus Pstatus 0.1251636
## higher higher 0.1063612
From the relative importance, we can see that goout, absences, sex and family relationship seem to have the highest effect on high alcohol use. The least important factors can also be seen from the model summary.
Now, i’ll show the response curve from GBM to see the effects
#Now, i'll show the response curve from GBM to see the effects
#plot(halc_gbm1)
best.iter1<-gbm.perf(halc_gbm1, plot.it = F, method = "OOB")
# pred_halc_gbm1<-predict.gbm(halc_gbm1,newdata = eva, best.iter1, type = "response")
par(mfrow=c(2,2))
plot.gbm(halc_gbm1, "goout", best.iter1)
plot.gbm(halc_gbm1, "absences", best.iter1)
plot.gbm(halc_gbm1, "sex", best.iter1)
plot.gbm(halc_gbm1, "famrel", best.iter1)
par(mfrow=c(1,1))
plot.gbm(halc_gbm1, "studytime", best.iter1)
As it can also be seen from the GM reponse curves, absences and going out, increases the tendency of high alcohol use. Going out steeply affects when it is more than average. Absences of slighty higher than 10 times can even increase the tendency. Family relationship however, reduces the tendency. and overall, male students seem to have relatively more high alcohol use than their female counterpart. Also, as shown earlier, more study time appears to reduce the tendency of high alcohol use.
#explore the odd's ratio
halc_glm_mod <- glm(high_use~ sex + studytime + goout + absences, data=data_reg, family = "binomial")
odds_ra <- exp(coef(halc_glm_mod))
#odds_ra <- coef(high_use_mod2) %>% exp #alternaive
# compute confidence intervals (conf_int)
conf_int <- exp(confint(halc_glm_mod))
#Conf_Int <- high_use_mod2 %>% confint() %>% exp #alternative
# print out the odds ratios with their confidence intervals
cbind(odds_ra, conf_int)
## odds_ra 2.5 % 97.5 %
## (Intercept) 0.04056573 0.01222314 0.1263579
## sexM 2.18525582 1.30370562 3.7010763
## studytime 0.65628388 0.46510383 0.9099505
## goout 2.06838526 1.64470314 2.6349716
## absences 1.08123884 1.03577383 1.1324143
Here, we can also see that the odd’s ratio of Male students, absences and going out are more than 1, which indicates success: that they all increase the tendency of high alcohol consumption by students. This is not surprising. Absence from school and going out would normally be expected to result in high alcohol use. The confidence interval shows that it is highly likely that these factors affect. On the other hand, study time shows failure with high alcohol consumption which is also not surprising, because when studying more, one would expect that students have less time for going out and getting drunk. These results are all in accordance with the results of the models used earlier.
ERROR IN PREDICTD HIGH ALCOHOL USE. Although, AUC is preferred, as it considers all possible threshold and prevent subjectiveness, I will be exploring a confusion matrix too and 0.5 will be made the threshold for TRUE or FALSE for high alcohol use. More information on creating functions to calculate these metrics in a more automated way can be found here
#fit the model
# predict() the probability of high_use
probs<- predict(halc_glm_mod, type = "response")
#Copy the data again
alc<-data_reg
#add the predicted probabilities to 'alc'
alc$prob_high_use <- probs
#alc <- mutate(alc, probability = probabilities)
#use the probabilities to make a prediction of high_use, setting 0.5 as threshold
alc$predict_high_use<- (alc$prob_high_use)>0.5
#alc <- mutate(alc, prediction = prob_high_use>0.5)
#see the first ten and last ten original classes, predicted probabilities, and class predictions
head(alc[,c("failures", "absences", "sex", "high_use", "prob_high_use", "predict_high_use")], n=10)
## failures absences sex high_use prob_high_use predict_high_use
## 1 0 3 F FALSE 0.03910480 FALSE
## 2 1 2 F TRUE 0.27212509 FALSE
## 3 0 8 F FALSE 0.09326837 FALSE
## 4 0 2 F FALSE 0.05424023 FALSE
## 5 1 5 F TRUE 0.03386206 FALSE
## 6 0 2 F FALSE 0.05424023 FALSE
## 7 1 0 F FALSE 0.04676258 FALSE
## 8 0 1 F FALSE 0.14322690 FALSE
## 9 0 9 F TRUE 0.06105537 FALSE
## 10 0 10 F FALSE 0.12696107 FALSE
#tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$predict_high_use)
## prediction
## high_use FALSE TRUE
## FALSE 252 16
## TRUE 65 49
#Here, I derived the various classification matrix's metrics
#Accuracy: Overall, how often is the classifier correct?
#(TP+TN)/total
accuracy<- (252 + 49)/nrow(alc)
print(paste("The accuracy of the classification is", accuracy))
## [1] "The accuracy of the classification is 0.787958115183246"
#Error rate
print(paste("The error rate/misclassification is", 1-accuracy))
## [1] "The error rate/misclassification is 0.212041884816754"
#True Positive Rate:
#"Sensitivity" or "Recall": When actually yes, how often does it predict yes?
#(TP/actual yes)
print(paste("The True positive rate/Sensitivity is", 49/(65+49)))
## [1] "The True positive rate/Sensitivity is 0.429824561403509"
#False Positive Rate: i.e When actually no, how often does it predict yes?
print(paste("The false positive rate is", 16/(16+252)))
## [1] "The false positive rate is 0.0597014925373134"
#Specificity: When actually no, how often does it predict no?
#TN/actual no
#Alternatively: 1 minus False Positive Rate
print(paste("The Specificity is", 252/(252+16)))
## [1] "The Specificity is 0.940298507462687"
#Precision: When it predicts yes, how often is it correct?
#TP/predicted yes
print(paste("The precision is", 49/(16+49)))
## [1] "The precision is 0.753846153846154"
#Prevalence: How often does the yes condition actually occur in our sample?
#actual yes/total
print(paste("The prevalence is", (65+49)/nrow(alc)))
## [1] "The prevalence is 0.298429319371728"
Accuracy of the prediction is about 0.79 which is fairly good. Miclassification/error rate is about 0.21 By using threshold of 0.5, we can see that 252 FALSE high alcohol use were Truly/rightly predicted and 16 falsely predicted. There are also 49 high alcohol use that were Truly predicted while 65, wrongly predicted. The prediction seems to be better when predicting no high alcohol use(i.e Specifity) than when predicting the true value(sensitivity. This can also be seen in the plot below.
Plotting the error of prediction.
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = prob_high_use, y = high_use, col= predict_high_use))
# define the geom as points and draw the plot
g + geom_point()
ALTERNATIVE FOR CALULATING THE ERROR
#Proportion of errors
conf_mat<-table(high_use = alc$high_use, prediction = alc$predict_high_use)
conf_mat<-prop.table(conf_mat)
addmargins(conf_mat)
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.65968586 0.04188482 0.70157068
## TRUE 0.17015707 0.12827225 0.29842932
## Sum 0.82984293 0.17015707 1.00000000
#mean error of the prediction
mean(abs(alc$high_use-alc$prob_high_use)>0.5)
## [1] 0.2120419
#The below is an alternative way by firstly defining the function to calculate the mean error.
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$prob_high_use)
## [1] 0.2120419
K-fold cross-validation
#K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = halc_glm_mod, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2094241
Here I used an arbitrary threshold of 0.5 but it has been suggested that prevalence can be used in selection of threshold. you can see more here about terminologies related to confusion/classification matrix.
###############################################################
#LINEAR DISCRIMINANT ANALYSIS.
#copy data again
data_copy2<- alco_data
#filter the numeric variables.
data_num<-Filter(is.numeric, data_copy2)
#Scale the numeric variables
data_num_sca<- data.frame( scale(data_num))
#get the categorised grades created earlier into the vector
G3_cat<-data_mca_fac2[,c("G3")]
#add to to the scaled data frame
data_num_G3<- cbind(data_num_sca, G3_cat)
#summary
summary(data_num_G3)
## age Medu Fedu traveltime
## Min. :-1.3519 Min. :-2.5831 Min. :-2.3402 Min. :-0.6434
## 1st Qu.:-0.4997 1st Qu.:-0.7422 1st Qu.:-0.5158 1st Qu.:-0.6434
## Median : 0.3525 Median : 0.1783 Median : 0.3964 Median :-0.6434
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3525 3rd Qu.: 1.0988 3rd Qu.: 1.3086 3rd Qu.: 0.7939
## Max. : 4.6133 Max. : 1.0988 Max. : 1.3086 Max. : 3.6683
## studytime failures famrel freetime
## Min. :-1.23721 Min. :-0.3458 Min. :-3.24319 Min. :-2.2541
## 1st Qu.:-1.23721 1st Qu.:-0.3458 1st Qu.: 0.06937 1st Qu.:-0.2233
## Median :-0.04374 Median :-0.3458 Median : 0.06937 Median :-0.2233
## Mean : 0.00000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.:-0.04374 3rd Qu.:-0.3458 3rd Qu.: 1.17356 3rd Qu.: 0.7921
## Max. : 2.34320 Max. : 4.8004 Max. : 1.17356 Max. : 1.8075
## goout Dalc Walc health
## Min. :-1.8897 Min. :-0.5434 Min. :-1.0162 Min. :-1.8397
## 1st Qu.:-0.9952 1st Qu.:-0.5434 1st Qu.:-1.0162 1st Qu.:-0.4099
## Median :-0.1007 Median :-0.5434 Median :-0.2320 Median : 0.3051
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.7938 3rd Qu.: 0.5847 3rd Qu.: 0.5522 3rd Qu.: 1.0200
## Max. : 1.6883 Max. : 3.9691 Max. : 2.1206 Max. : 1.0200
## absences G1 G2 G3
## Min. :-0.8238 Min. :-3.5147 Min. :-2.6409 Min. :-3.4614
## 1st Qu.:-0.6407 1st Qu.:-0.5517 1st Qu.:-0.5185 1st Qu.:-0.4405
## Median :-0.2746 Median : 0.1891 Median : 0.1889 Median : 0.1637
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.2746 3rd Qu.: 0.9298 3rd Qu.: 0.8963 3rd Qu.: 0.7679
## Max. : 7.4139 Max. : 2.4113 Max. : 2.3112 Max. : 1.9763
## alc_use G3_cat
## Min. :-0.9024 vlow :143
## 1st Qu.:-0.9024 low :107
## Median :-0.3947 high : 64
## Mean : 0.0000 vhigh: 68
## 3rd Qu.: 0.6207
## Max. : 3.1592
**DIVIDE DATA INTO TEST AND TRAIN DATA FOR VALIDATION.
#now, divide into 80:20 train:test data
samp <- sample(1:nrow(data_num_G3), size = nrow(data_num_G3)*0.8)
train <- data_num_G3[samp,]
test <- data_num_G3[-samp,]
#Dimension.
dim(train)
## [1] 305 18
#remove the grades columns
train$G1<-train$G2<-train$G3<-NULL
#check the column names now
colnames(train)
## [1] "age" "Medu" "Fedu" "traveltime" "studytime"
## [6] "failures" "famrel" "freetime" "goout" "Dalc"
## [11] "Walc" "health" "absences" "alc_use" "G3_cat"
fitting the categorised grade to all other variables
#fit the categorised grade to all other variables
lda.fit <- lda(G3_cat~., data = train)
#lda.fit
#create arrow function
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
#convert the categorised grade into numeric for the LDA plot
train$G3_cat <- as.numeric(train$G3_cat)
#plot
plot(lda.fit, dimen = 2, col = train$G3, pch= train$G3)
lda.arrows(lda.fit, myscale = 2)
failures in the past is contributing most to poor grades. Absences also appear to be. Mother’s education level seems to be a factor contributing to students’ success.
LDA CLASSIFICATION ACCURACY
#see the class prediction accuracy
G3_cat<-test$G3_cat
test<-dplyr::select(test, -G3_cat)
lda.pred<-predict(lda.fit, newdata = test)
table(correct = G3_cat, predicted = lda.pred$class)
## predicted
## correct vlow low high vhigh
## vlow 16 9 0 2
## low 14 9 0 2
## high 7 4 4 1
## vhigh 5 1 0 3
The table above shows the right classification and misclassification.
Distance between variables.
#Now, see the distance between variables.
dist_sca<-dist(data_num_sca)
summary(dist_sca)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5492 4.4994 5.4263 5.6181 6.5457 13.4290
Next, cluster the data better using k-means clustering and refit the LDA using the clustered data.
First, determine the optimal number of clusters using the elbow method of TWCSS.
Distance between variables.
set.seed(123) #This is used to make the values constant when re-run
# determine the number of clusters
k_max <- 20
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(data_num_sca, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = c('point', 'line'))
############################################################
#some other methods of exploring optimal number of classes
# library(cluster)
# library(factoextra)
# # reVisualize k-means clusters
# km.res <- kmeans(data_num_sca, 3, nstart = 25)
# fviz_cluster(km.res, data = data_num_sca, geom = "point",
# stand = FALSE, frame.type = "norm")
#
# #install.packages("factoextra")
# require(cluster)
# fviz_nbclust(data_num_sca, kmeans, method = "silhouette")
#############################################################
**Further, use the NbClust to confirm. There are quite many others too. Two others are in the previous code chunk where the packages “cluster” and “factoextra” can be used. See more here for other cluster methods of checking the optimal number of classes.
#install.packages("cluster")
library(NbClust)
nb <- NbClust(data_num_sca, diss=NULL, distance = "euclidean",
min.nc=2, max.nc=5, method = "kmeans",
index = "all", alphaBeale = 0.1)
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 8 proposed 2 as the best number of clusters
## * 13 proposed 3 as the best number of clusters
## * 1 proposed 4 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
#Based on these, I will make my number of clusters, 3
# k-means clustering
#use the euclidean distance calculated earlier.
km <-kmeans(dist_sca, centers = 3)
#perform LDA on the clustered data
lda.fit_clus<- lda(km$cluster~., data=data_num_sca)
plot(lda.fit_clus, dimen = 2, col = km$cluster, pch= km$cluster)
lda.arrows(lda.fit_clus, myscale = 2)
Seems past failure and alcohol use also causes poor performance
#copydata again
data_copy2<-alco_data
data_num2<-Filter(is.numeric, data_copy2)
data_LDA1_sca<- data.frame( scale(data_num2))
summary(data_LDA1_sca)
## age Medu Fedu traveltime
## Min. :-1.3519 Min. :-2.5831 Min. :-2.3402 Min. :-0.6434
## 1st Qu.:-0.4997 1st Qu.:-0.7422 1st Qu.:-0.5158 1st Qu.:-0.6434
## Median : 0.3525 Median : 0.1783 Median : 0.3964 Median :-0.6434
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3525 3rd Qu.: 1.0988 3rd Qu.: 1.3086 3rd Qu.: 0.7939
## Max. : 4.6133 Max. : 1.0988 Max. : 1.3086 Max. : 3.6683
## studytime failures famrel freetime
## Min. :-1.23721 Min. :-0.3458 Min. :-3.24319 Min. :-2.2541
## 1st Qu.:-1.23721 1st Qu.:-0.3458 1st Qu.: 0.06937 1st Qu.:-0.2233
## Median :-0.04374 Median :-0.3458 Median : 0.06937 Median :-0.2233
## Mean : 0.00000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.:-0.04374 3rd Qu.:-0.3458 3rd Qu.: 1.17356 3rd Qu.: 0.7921
## Max. : 2.34320 Max. : 4.8004 Max. : 1.17356 Max. : 1.8075
## goout Dalc Walc health
## Min. :-1.8897 Min. :-0.5434 Min. :-1.0162 Min. :-1.8397
## 1st Qu.:-0.9952 1st Qu.:-0.5434 1st Qu.:-1.0162 1st Qu.:-0.4099
## Median :-0.1007 Median :-0.5434 Median :-0.2320 Median : 0.3051
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.7938 3rd Qu.: 0.5847 3rd Qu.: 0.5522 3rd Qu.: 1.0200
## Max. : 1.6883 Max. : 3.9691 Max. : 2.1206 Max. : 1.0200
## absences G1 G2 G3
## Min. :-0.8238 Min. :-3.5147 Min. :-2.6409 Min. :-3.4614
## 1st Qu.:-0.6407 1st Qu.:-0.5517 1st Qu.:-0.5185 1st Qu.:-0.4405
## Median :-0.2746 Median : 0.1891 Median : 0.1889 Median : 0.1637
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.2746 3rd Qu.: 0.9298 3rd Qu.: 0.8963 3rd Qu.: 0.7679
## Max. : 7.4139 Max. : 2.4113 Max. : 2.3112 Max. : 1.9763
## alc_use
## Min. :-0.9024
## 1st Qu.:-0.9024
## Median :-0.3947
## Mean : 0.0000
## 3rd Qu.: 0.6207
## Max. : 3.1592
#removwe the alcohol columns
data_LDA1_sca$alc_use<-data_LDA1_sca$Walc<-data_LDA1_sca$Dalc<-NULL
#see colmn names
colnames(data_LDA1_sca)
## [1] "age" "Medu" "Fedu" "traveltime" "studytime"
## [6] "failures" "famrel" "freetime" "goout" "health"
## [11] "absences" "G1" "G2" "G3"
#Get the alcohol use into a cector
alc_use<-alco_data[,c("alc_use")]
#combine with the scaled data
data_LDA2<- cbind(data_LDA1_sca, alc_use)
#summary(data_LDA2)
lda.fit2 <- lda(alc_use~., data = data_LDA2)
#lda.fit
#first, determine the optimal number of clusters using twcss.
set.seed(123)
# determine the number of clusters
k_max <- 10
alc_sca<-as.data.frame(scale(data_LDA2))
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(alc_sca, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
# k-means clustering
dist_alc<- dist(alc_sca)
km <-kmeans(dist_sca, centers = 4)
#perform LDA on the clustered data
lda.fit_clus<- lda(km$cluster~., data=alc_sca)
lda.fit_clus
## Call:
## lda(km$cluster ~ ., data = alc_sca)
##
## Prior probabilities of groups:
## 1 2 3 4
## 0.24345550 0.23036649 0.09162304 0.43455497
##
## Group means:
## age Medu Fedu traveltime studytime failures
## 1 -0.15150389 0.47524588 0.31793533 -0.2879191 0.34125036 -0.3273272
## 2 0.09100733 -0.26100930 -0.13226160 0.2385657 -0.27429626 0.2974988
## 3 0.64464336 -0.40027789 -0.56793020 0.9170400 -0.58932564 1.5656617
## 4 -0.09928495 -0.04348989 0.01173851 -0.1585163 0.07848304 -0.3044375
## famrel freetime goout health absences G1
## 1 0.06937305 -0.037671026 -0.38924340 -0.20230345 -0.2411265 1.2245145
## 2 -0.33214977 0.007491072 0.22458595 0.11007841 0.2933109 -0.6316473
## 3 -0.11991628 0.327936363 0.56380349 0.08036769 0.8028458 -0.8691426
## 4 0.16249732 -0.052009528 -0.01986174 0.03803886 -0.1896759 -0.1679210
## G2 G3 alc_use
## 1 1.2196444 1.15119286 -0.4930008
## 2 -0.6511951 -0.66705988 0.4591347
## 3 -0.9430175 -0.98425931 1.4475137
## 4 -0.1392539 -0.08379874 -0.2723962
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## age -0.07064798 -0.10742109 -0.154803905
## Medu -0.21614373 -0.28946292 -0.292725530
## Fedu -0.04019076 0.11655127 0.358301628
## traveltime 0.30661476 -0.31082720 -0.012004516
## studytime -0.03182434 -0.07466985 -0.124071455
## failures 0.38617707 -0.87803992 -0.248719853
## famrel -0.08948820 -0.04204097 -0.813033662
## freetime -0.03045788 -0.13204798 -0.033243313
## goout 0.08390191 0.25133500 0.003604871
## health -0.02027092 0.12273177 0.098613125
## absences 0.37848441 -0.27521538 0.221078909
## G1 -0.36587629 -0.65489022 0.378096333
## G2 -0.25588535 -0.69491291 0.318987655
## G3 -0.63484376 0.34177356 -0.616701007
## alc_use 0.85244671 -0.37363870 -0.003084369
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.8474 0.1444 0.0083
#plot
plot(lda.fit_clus, dimen = 2, col = km$cluster, pch= km$cluster)
lda.arrows(lda.fit_clus, myscale = 2)
Here, we can see that going out, free time and absences tend tend to cause high alcohol use. On the other hand, study tume, family relationship tend to reduce it.
3D VISUALISATION OF THE CLUSTERS
model_predictors <- dplyr::select(data_LDA2, -alc_use)
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
#Next, install and access the plotly package. Create a 3D plot (Cool!) of the columns of the matrix product by typing the code below.
library(plotly)
#using the plotly package for 3D plotting of the matrix products' columns.
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=data_LDA2$alc_use)